Sprite 1984

home *** CD-ROM | disk | FTP | other *** search

/ Sprite 1984 - 1993 / Sprite 1984 - 1993.iso / admin / bugs / bugs.archive < prev next >

Wrap

Text File | 1992-07-01 | 204.5 KB | 6,369 lines

Log-Number: 31981 Subject: status of IOC_MAP ioctl? Date: Wed, 01 Jan 92 18:49:46 PST From: Mike Kupfer <kupfer> The Sprite server is getting back GEN_INVALID_ARG when it tries to inform Allspice that it (the Sprite server) is about to map a file. As near as I can tell, this is because the IOC_MAP case in Fsio_FileIOControl has been if'd out... since 1989. It looks like the ioctl is no longer valid. If it were enabled, it would call Fsconsist_MappedConsistency, which has been ifdef'd into a no-op. So this this simply a matter of dead code that should be flushed, or is this code that doesn't quite work right so it's turned off? Also, we should fix Vm_MmapInt to pay attention to the return value from Fs_FileBeingMapped. mike Log-Number: 31982 Date: Thu, 2 Jan 92 11:26:05 PST From: shirriff (Ken Shirriff) Subject: Re: status of IOC_MAP ioctl? The IOC_MAP and Fsconsist_MappedConsistency is dead code, so it can be taken out. (The original plan was to ensure consistency of mapped files across multiple machines. However, since there wasn't anyone wanting to use this feature I didn't implement it.) Ken Log-Number: 31983 Date: Thu, 2 Jan 92 15:49:26 PST From: shirriff (Ken Shirriff) Subject: Strange pmake messages If I interrupt a pmake with a ^C, I get a bunch of messages: JobCondPassSig: couldn't send signal 2 to process d1d1d: no such process JobInterrupt: couldn't send signal 2 to process 41d36: no such process I don't think pmake used to do this. It's not really a problem, but it's kind of strange. Ken Log-Number: 31984 Subject: Re: Strange pmake messages Date: Thu, 02 Jan 92 16:34:38 PST From: Mike Kupfer <kupfer> When I was working on the pmake hangs I fixed all the kill() statements to check for error returns. Some of the checks always print a warning message, others only print something if debugging is turned on. If people think the messages are a nuisance, I can fix pmake so that a warning is printed only if debugging is turned on. However, I would rather not spend the time to figure out when it is reasonable for kill() to fail (so that a warning is only printed when something unexpected happens). mike Log-Number: 31985 Date: Fri, 3 Jan 92 10:25:32 -0800 From: sullivan@postgres.Berkeley.EDU (Mark Sullivan) Subject: Sprite dies if process table overflows I ran an application program that forked N processes, where N was a command line argument. I corrupted the argument and the program tried to create 4000 processes. Sprite printed a message to the console about "couldn't find a free PCB" and went into the debugger. My program is fixed now, so this isn't a high priority bug as far as my work is concerned. Mark Log-Number: 31987 Subject: gremlin and R5 Date: Fri, 03 Jan 92 12:52:20 PST From: Mike Kupfer <kupfer> Gremlin doesn't work with R5 because R5 doesn't have all the fonts that it wants. Part of the problem may simply be one of aliasing (i.e., the fonts are there, but gremlin expects aliases that weren't set up for R5). However, there may also be fonts that were locally installed for R4 but not for R5, and gremlin might rely on some of those, too. mike Log-Number: 31999 Date: Fri, 10 Jan 92 00:54:48 PST From: shirriff (Ken Shirriff) Subject: X11R5 gremlin problem fixed I fixed two problems that prevented gremlin from working in X11R5. a) X11R5 doesn't have the font screen.r.12 which gremlin expected. I guess this font is no longer supported or something. I switched gremlin to use 6x12. There may be other programs that use screen.r.12; they will have to be changed too for X11R5. b) For some reason creating a window with CWDontPropagate set caused a BadValue (integer parameter out of range for operation) error from X. I don't know what DontPropagate does or why it caused an error, but gremlin seems to work without it. Any explanation would be welcome. Ken Log-Number: 32006 Date: Fri, 10 Jan 92 19:16:28 PST From: bsw!adam@uunet.UU.NET (Adam de Boor) Subject: X11R5 gremlin problem fixed I believe the DontPropagate mask for a window can keep pointer and keyboard events from being delivered higher up the window heirarchy ("if I don't want it, no one else can have it either"), but it can only be set in an XChangeWindowAttributes call, not when the window is being created, for whatever reason (you'd have to look in the protocol spec to find out). a Log-Number: 32000 Date: Fri, 10 Jan 92 13:16:00 PST From: ouster (John Ousterhout) Subject: New gremlin Unfortunately, the newly-installed gremlin, which avoids using "screen.r.12" or some such font that isn't available in R5, looks horrible to me (text unreadably small) and some of the other fonts appear wrong too (smaller than old gremlin), so that there is no longer a proper WYSIWYG effect. I've backed out the old version from /X11/R4/cmds.sun4.old so I can get my stuff ready for the X Conference. If you need a font for R5, why not just copy it over from the R4 font directory? The "screen" family isn't part of X proper (I copied it from X10 to X11, and from R1 to R2, R2 to R3, and R3 to R4). Log-Number: 32001 Date: Fri, 10 Jan 92 13:51:02 PST From: shirriff (Ken Shirriff) Subject: Re: New gremlin I started copying the screen fonts from R4 to R5, but it turned out to be horribly complex. They've changed the format of the fonts, so you have to run them all through a converter. Then, setting up the imake files correctly is also a nasty task. I think we would be much better moving gremlin forward to use the new fonts, rather than continually trying to keep the obsolete fonts. (Sorry about installing a new gremlin just before the conference; I didn't think of that.) Ken Log-Number: 31988 Subject: race condition in RPC code? Date: Sun, 05 Jan 92 17:09:45 PST From: Mike Kupfer <kupfer> I just tracked down a race in the Sprite server between the network and timer. I think it's a potential problem for the native Sprite kernel, but I'm not sure. Maybe it's only a problem with multiprocessors? Or maybe it's not a problem at all, because some of these routines are called at interrupt level, rather than by dedicated processes. You tell me. Here's the race (with time going down). Note that in the Sprite server, the timer queue is serviced by waking up a separate process that finds the "expired" elements in the queue and calls their routines. requesting timer net input process process process ----- ----- ----- Send RPC request. Put channel in timeout queue. Block on channel's condition variable and release channel's master lock. RPC times out. Get timer master Get RPC response. lock. Take channel Obtain channel's off of timeout queue. master lock. Release timer master lock. Take channel off timeout queue (fails silently). Call Rpc_Timeout. Block on channel's master lock. Broadcast on channel's condition variable. Release channel's master lock. Get channel's master lock. Process RPC response, put channel back on timeout queue (the response was an ACK?). Block on channel's condition variable & release its master lock. Obtain channel's master lock. Broadcast on channel's cond. variable and release master lock. Obtain channel's master lock. Resend RPC request and put channel on timeout queue. ---------- This second call to put the channel on the timeout queue leads to disaster, because it's already on the queue and the timer and list packages can't deal with it. Anyway, the bug in all this is that RpcClientDispatch doesn't check the return code from Timer_DescheduleRoutine. From looking at the code, my guess is that this check should be done much earlier (e.g., right after the "Discover our own Sprite ID" code), and RpcClientDispatch should bail out if Timer_DescheduleRoutine returns FALSE. Does anyone think there will be problems doing it this way? thanks, mike Log-Number: 32008 Subject: Re: race condition in RPC code? Date: Sun, 12 Jan 92 20:11:52 PST From: Mike Kupfer <kupfer> I said > From looking at the > code, my guess is that this check should be done much earlier (e.g., > right after the "Discover our own Sprite ID" code), and > RpcClientDispatch should bail out if Timer_DescheduleRoutine returns > FALSE. This is false, because you don't want to deschedule the timer element until you've made sure you aren't going to throw the packet away for some reason (e.g., bogus ID number). mike Log-Number: 31991 Date: Tue, 7 Jan 92 15:07:56 PST From: shirriff (Ken Shirriff) Subject: Kernel build problem If your mainHook.c for the sun4c doesn't have main_PrintInitRoutines = FALSE; in Main_InitVars(), the kernel will immediately crash on execution. (I found this out since my kernels wouldn't build but John's would. The reason was I didn't have the line. My kernels used to work, so I don't know why this line suddenly became significant.) Ken Log-Number: 31993 Date: Wed, 8 Jan 92 01:27:27 PST From: shirriff (Ken Shirriff) Subject: Profiling bugs fixed I've fixed a couple bugs in profiling: a) The kernel wouldn't recognize profiled sun programs. b) The pc arithmetic would overflow, so if you had a pc>0x10000, it would wrap around and the profiler would think the program was somewhere else. These fixes will be in the next kernel. Ken Log-Number: 31994 Date: Thu, 9 Jan 92 08:40:39 PST From: ouster (John Ousterhout) Subject: Allspice reboot I rebooted Allspice this morning because it suddenly stopped servicing clients. It appeared OK from the console, and I realized after I rebooted it that this was probably just a case of the timer interrupt lossage and that I probably should have tried L1-A and continue before rebooting. Sorry... -John- Log-Number: 31995 Subject: potential hang in RPC channel allocation code Date: Thu, 09 Jan 92 14:36:12 PST From: Mike Kupfer <kupfer> I found this bug in the Sprite server; it also appears to be present in the kernel. If there are no free channels, RpcChanAlloc waits for the condition variable freeChannels. RpcChanFree does the broadcast on freeChannels, but only if numFreeChannels is exactly equal to one. Unfortunately, RpcChanFree is not the only routine that increments numFreeChannels. RpcChanClose can also increment numFreeChannels, and it doesn't do any broadcast on freeChannels. This can cause a process to get stuck in RpcChanAlloc. If that process holds a sufficient number of important locks, things can grind to a halt. Either (1) RpcChanClose should check numFreeChannels and do the broadcast if necessary (perhaps by calling an internal version of RpcChanFree), or (2) RpcChanFree should change its test from "== 1" to ">= 1". mike Log-Number: 31996 Subject: dumps failed last night Date: Thu, 09 Jan 92 15:15:00 PST From: Mike Kupfer <kupfer> The dumps failed last night, with the following messages in hijack's syslog: Warning: Exabyte 8500 at SCSI#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0xdd 0xb Additional Sense Code: 0x9 Additional Sense Code Qualifier: 0x0 EXB8500 Fault Symptom Code = 0xae Warning: Exabyte 8500 at SCSI#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0xdd 0xb Additional Sense Code: 0x9 Additional Sense Code Qualifier: 0x0 EXB8500 Fault Symptom Code = 0xae I couldn't find an Exabyte reference manual, so I don't know what these messages mean. I notice that the tape in question (#193) is a "Hi8" tape, which is not the same type as most of our other dump tapes. mike Log-Number: 31997 Date: Thu, 9 Jan 92 17:00:02 PST From: shirriff (Ken Shirriff) Subject: X access control disabled I've installed a new sun4 X11R5 server that fixes Mike's problem with access control. The access control mode is defined in the file server/include/site.h. Ken Log-Number: 32005 Subject: new Emacs is broken Date: Fri, 10 Jan 92 17:53:06 PST From: Mike Kupfer <kupfer> ... at least on a sun4. The compilation subwindow stuff (e.g,. "grep") hangs, rather than detect that the subprocess has finished. mike Log-Number: 32007 Date: Sat, 11 Jan 92 23:51:05 PST From: mottsmth (Jim Mott-Smith) Subject: new Emacs is broken Emacs is fixed. The problem was due to a change made in the fcntl module which was picked up when I relinked emacs yesterday. I backed out the modification, rebuilt libc.a and relinked emacs. Jhh, the questionable version is fcntl.c.bak. -- Jim M-S Log-Number: 32010 Date: Mon, 13 Jan 92 11:20:19 PST From: mottsmth (Jim Mott-Smith) Subject: fcntl Emacs does a fcntl(fd, F_SETFL, O_NDELAY) to set non-blocking I/O. This used to become a Fs_IOControl with the IOC_SET_BITS parameter. In the new version of fcntl it uses IOC_SET_FLAGS, which doesn't seem to do the job. A read on the channel by emacs hangs waiting for data. -- Jim M-S Log-Number: 32012 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 13 Jan 1992 16:15:59 PST Subject: volatile and mips Our sprite.h defines volatile to be nothing if __STDC__ is not set. The mips compiler does not define __STDC__, but it does understand volatile. If we change sprite.h then we can probably compile the net and dev modules with optimization turned on. John Log-Number: 32014 Subject: R5 startup unreliable Date: Tue, 14 Jan 92 14:27:10 PST From: Mike Kupfer <kupfer> I have been having random problems starting up X11 R5 on Sage. xinit complains xinit: invalid argument (errno 22): Server error. and then exits. However, the server itself does in fact start up, so I have to do an L1-k to get back to the shell and kill it. When I try to start X a second time, everything works okay. mike Log-Number: 32015 Subject: two raid1 crashes Date: Tue, 14 Jan 92 14:56:22 PST From: Mike Kupfer <kupfer> The first one everyone knows about (/r3 had a bad block), but I don't think a message about it made it into the log. The second one was a mystery crash just a few minutes ago. The display was dark, so we couldn't read any console messages. Also, raid1 didn't respond to "kmsg -v" (though someone may have hit the reset switch by the time we tried that), so we just rebooted. By the way, when raid1 rebooted after the second crash, there were a bunch of messages of the form "DMA space already valid at xxx". Does anyone know if this is something to be worried about? mike Log-Number: 32016 Date: Tue, 14 Jan 92 14:58:50 PST From: shirriff (Ken Shirriff) Subject: Re: two raid1 crashes Raid1 might have crashed a few minutes ago because I was doing some Ultranet things. (But that might not be the cause.) Ken Log-Number: 32017 Date: Tue, 14 Jan 92 17:27:58 PST From: mottsmth (Jim Mott-Smith) Subject: /usr/sww/bin/perl seg faults The perl script classgrid runs successfully with the local Perl, but dies with a seg fault using /usr/sww/bin/Perl. -- Jim M-S Log-Number: 32018 Date: Tue, 14 Jan 92 17:57:34 PST From: shirriff (Ken Shirriff) Subject: Bad Ultranet board in fenugreek The Ultranet board in fenugreek times out on boot. According to jhh, this is likely a problem with the board and should be fixed. Log-Number: 32019 Date: Tue, 14 Jan 92 17:58:20 PST From: shirriff (Ken Shirriff) Subject: Sun3 crashes About every 4 hours catnip crashes with "Fatal Error: Current process is NIL". This may be a sun3 problem or an ultranet problem. Ken Log-Number: 32021 Subject: raid1 reboot: MachHandleWindowUnderflow Date: Fri, 17 Jan 92 19:54:26 PST From: Mike Kupfer <kupfer> Raid1 was killing some of my commands with MachHandleWindowUnderflow: killing process. so I rebooted it. mike Log-Number: 32022 Subject: Fs_GetAttrStream can return wrong size? Date: Sat, 18 Jan 92 20:16:18 PST From: Mike Kupfer <kupfer> I have a 32MB file that the Sprite server created. Unfortunately, the Sprite server seems to get into a state where Fs_GetAttrStream claims that it has a size of 0. (Restarting the server makes the problem go away.) I suspect the problem has something to do with recovery, because creating the file often causes Mach to hang for long enough that the Sprite server has to do recovery with raid1. Other attributes (owner, last access time, permissions) seem to be okay, so I don't understand why just the size would be wrong. Has anyone seen something like this in native Sprite? The file is mapped (i.e., being grown by Fs_PageWrite) when the Sprite server does recovery with raid1, in case that's relevant. mike Log-Number: 32024 Subject: two "bad stream type" crashes Date: Sun, 19 Jan 92 19:05:44 PST From: Mike Kupfer <kupfer> Sage crashed twice recently with Fs_RetSegPtr, bad stream type <some large negative number> The panic actually happens in Fs_GetSegPtr; it's a typo in the code. I ignored the first crash but tried to debug the second one. Basically, the file handle passed to Fs_GetSegPtr is garbage. My guess is that a file handle for a text segment is getting freed and the VM module isn't getting notified, so the VM segment has a dangling pointer. The segment in question in the second crash had an objFileName of /users/kupfer/cmds.ind/msgchk, which is a symbolic link to /usr/sww/bin/msgchk, so I wonder if this is somehow related to the execution of nfsmounted files. If we can't find this problem by inspection, one idea I had was to change REMOVE_HANDLE() to call a VM debug routine to verify that the handle isn't pointed to by any segments. mike P.S. Here's the stack backtrace: #0 panic (__builtin_va_alist=-167567283) (sysPrintf.c line 220) #1 0xf603210c in Fs_GetSegPtr () (fsStreamOps.c line 944) #2 0xf60c5198 in CleanSegment ( segPtr=(struct Vm_Segment *) 0xf619f22c) (vmSeg.c line 886) #3 0xf60c502c in DeleteSeg ( segPtr=(struct Vm_Segment *) 0xf619f22c) (vmSeg.c line 826) #4 0xf60c48cc in Vm_SegmentNew (type=3, filePtr=(struct Fs_Stream *) 0xffffffff, fileAddr=0, numPages=1, offset=122879, procPtr=(struct Proc_ControlBlock *) 0xf6438990) (vmSeg.c line 466) #5 0xf608dec0 in SetupVM ( procPtr=(struct Proc_ControlBlock *) 0xf6438990, objInfoPtr=(ProcObjInfo *) 0xf61a7208, codeFilePtr=(struct Fs_Stream *) 0xf644b9f8, usedFile=1, codeSegPtrPtr=(struct Vm_Segment **) 0xf8155c24, execInfoPtr=(Vm_ExecInfo *) 0xf61a772c) (procExec.c line 1586) #6 0xf608d5f4 in DoExec (fileName=(char *) 0xffffffff, userArgsPtr=(UserArgs *) 0xf8155dc8, encapPtrPtr=(ExecEncapState **) 0xffffffff, debugMe=0) (procExec.c line 1185) #7 0xf608c8a0 in Proc_Exec (fileName=(char *) 0x57088, argPtrArray=(char **) 0x575b8, envPtrArray=(char **) 0x1dfffa18, debugMe=0, host=0) (procExec.c line 390) #8 0xf608c704 in Proc_ExecEnv (fileName=(char *) 0x57088, argPtrArray=(char **) 0x575b8, envPtrArray=(char **) 0x1dfffa18, debugMe=0) (procExec.c line 258) #9 0xf601286c in MachFetchArgsEnd () Log-Number: 32025 Date: Sun, 19 Jan 92 21:42:28 PST From: mottsmth (Jim Mott-Smith) Subject: vfscanf Sprite's handling of the following is inconstent with both SunOS and Ultrix. If I read K&&R right, Sprite is wrong. Sprite says: cnt=2, i=3, j=16777214 on a sun4 and cnt=2, i=3, j=-256 on a decstation. (Presumably just byte order differences). The others say: cnt=1, i=3, j=-2 -- Jim M-S ====================================== #include <stdio.h> int main() { static char buf[10] = "3 4"; int n; int i = -1; int j = -2; n = sscanf(buf, "%d %[^4]", &i, &j); printf("cnt=%d, i=%d, j=%d\n", n, i, j); } Log-Number: 32029 Subject: file attributes not updated correctly Date: Tue, 21 Jan 92 13:04:37 PST From: Mike Kupfer <kupfer> I've just ran into an (other) instance of file attributes being cached and not correctly updated. If host A caches the attributes for program myProg and host B then changes myProg to be setuid to root, A will not get the new permissions. I understand that attribute consistency is something of a swamp, but there are two specific things that could be done to make the problem less burdensome. First, if host A does a "get attributes" on the program myProg (e.g., the user does "ls myProg" to verify that the permissions are right), then the cached attributes should get updated. Currently this doesn't happen. sage% ./ls -ld tmp/foo d--------- 2 kupfer 512 Jan 20 12:24 tmp/foo sage% ./ls tmp/foo tmp/foo unreadable sage% ./ls -l ls -rwsrwxr-x 1 root 78930 Nov 20 00:25 ls sage% migrate ./ls -l tmp/foo total 28 -rw-r--r-- 1 kupfer 28038 Jan 20 12:24 Todo sage% ./ls tmp/foo tmp/foo unreadable Second, there should be a command (or an option to fscmd) that the user can run to force attributes consistency. Brent, do you have any pointers to existing code that would help make either of these two suggestions work? thanks, mike Log-Number: 32030 Date: Tue, 21 Jan 92 13:27:34 PST From: pmchen (Peter M. Chen) Subject: pause during cleaning I think this is a well-known problem, but I have some measurements which may be enlightening. I have some measurements of I/O's that took 71, 81, 96, and 114 SECONDS to complete (I think this happened during cleaning). This was on mustard, a ds5000 (I think it was running 1.109). The files were on a local disk connected to mustard; there was only one process issuing I/O to that disk. Is there anything that can be done to make cleaning less disruptive? Pete Log-Number: 32031 Date: Tue, 21 Jan 92 20:00:49 PST From: mottsmth (Jim Mott-Smith) Subject: Lust died with 'packet too large' message Lust died for no apparent reason with OutputPacket: packet too large (4174) I couldn't get anywhere with Dill so I rebooted Lust. -- Jim M-S Log-Number: 32033 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 22 Jan 1992 11:48:42 PST Subject: negative minor number in handle Hijack crashed running 1.109 when it tried to clean a dirty block from a file. The fileNum field in the Fscache_Block structure for the block was set to 912, but the minor number in the Fs_HandleHeader structure associated with the Fscache_FileInfo was set to -912, so it paniced. Looks to me like somehow 912 got changed to -912, which requires more than changing a single bit. I don't think we ever to arithmetic on minor number, so the odds of fixing this one are low. John Log-Number: 32034 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 22 Jan 1992 11:57:43 PST Subject: Re: negative minor number in handle I've got some more information on this bug. When Hijack tried to write back the file the rpc timed out because Lust was down. When Lust rebooted and recovered the re-open of the file in question failed due to a version mismatch. Hijack subsequently crashed trying to write the file back. This shouldn't have happened because Hijack should have cleaned up the file's state once it couldn't be reopened. Perhaps there is a synchronization problem between the block cleaner and the cleanup of a failed reopen? John Log-Number: 32035 Date: Wed, 22 Jan 92 12:01:18 PST From: ouster (John Ousterhout) Subject: More on trashed mailbox Incidentally, my mailbox seems to have a bunch of NULLs in it now that suddenly appeared around the same time that a bunch of my messages disappeared. -John- Log-Number: 32038 From: mgbaker (Mary Gray Baker) Subject: timer queue garbaged Date: Wed, 22 Jan 92 21:38:49 PST Jaywalk just died while trying to open a directory. It was trying to insert a new element into the timer queue, but it tried to insert the element in front of another one that was garbage. #0 panic (__builtin_va_alist=-167540547) (sysPrintf.c line 228) sysPrintf.c: no such file or directory. #1 0xf6038d50 in MachHandleTrap (trapType=112, pcValue=(char *) 0xf60dc4fc "\320\002\240\f\200\242@\b\006\200", trapPsr=286265284) (sun4c.md/machCode.c line 1854) #2 0xf603aff4 in MachReturnFromTrap () #3 0xf60dc4d0 in Timer_ScheduleRoutine (newElementPtr=(Timer_QueueElement *) 0xf626d4c0, interval=1) (timerQueue.c line 357) #4 0xf60c5e14 in RpcDoCall (serverID=1, chanPtr=(struct RpcClientChannel *) 0xf626d4b0, storagePtr=(struct Rpc_Storage *) 0xf8239798, command=7, srvBootIDPtr=(unsigned int *) 0xf82392ac, notActivePtr=(ClientData) 0xf82392a4, fastBootPtr=(ClientData) 0xf823929c) (rpcClient.c line 189) #5 0xf60c4700 in Rpc_Call (serverID=1, command=7, storagePtr=(struct Rpc_Storage *) 0xf8239798) (rpcCall.c line 204) #6 0xf6086e44 in FsrmtOpen (prefixHandle=(struct Fs_HandleHeader *) 0xf648e0b8, relativeName=(char *) 0xf643fae8 "kernel/mgbaker/printOn", argsPtr=(char *) 0xf8239908 "", resultsPtr=(char *) 0xf82398d0 "\370#\2320\366\f\367\314\370#\231p", newNameInfoPtrPtr=(struct Fs_RedirectInfo **) 0xf823982c) (fsrmtDomain.c line 301) #7 0xf6082c90 in Fsprefix_LookupOperation (fileName=(char *) 0xf643fadc "/sprite/src/kernel/mgbaker/printOn", operation=2, follow=4096, argsPtr=(char *) 0xf8239908 "", resultsPtr=(char *) 0xf82398d0 "\370#\2320\366\f\367\314\370#\231p", nameInfoPtr=(struct Fs_NameInfo *) 0xf6463450) (fsprefixOps.c line 169) #8 0xf605450c in Fs_Open (...) (...) #9 0xf6054a58 in Fs_ChangeDir (...) (...) #10 0xf60641f0 in Fs_ChangeDirStub (...) (...) #11 0xf603baec in MachFetchArgsEnd () Mary Log-Number: 32043 Subject: compat egrep botches "or" syntax Date: Sun, 26 Jan 92 15:59:02 PST From: Mike Kupfer <kupfer> egrep is supposed to recognize "stringA|stringB" as matching stringA or stringB. However, the compat version of egrep requires that the vertical bar be escaped. Thus sage% which egrep /sprite/cmds.compat/egrep sage% echo foo | egrep "foo|bar" sage% echo foo | egrep "foo\|bar" foo sage% echo foo | /sprite/cmds/egrep "foo|bar" foo sage% echo foo | /sprite/cmds/egrep "foo\|bar" sage% mike Log-Number: 32044 From: dlong@cats.UCSC.EDU Date: Mon, 27 Jan 92 13:11:22 -0800 Subject: LFS panic I'm running the 106 kernel over here, and just started getting the following panic: "LfsOkToRead read from clean segment". Is there any chance a newer kernel will solve the problem? By the way, almost all the partitions are LFS, including / and /sprite. I'm not sure which one is causing the panic. dl Log-Number: 32157 Subject: /user1/jamin is messed up Date: Wed, 26 Feb 92 17:37:23 PST From: Mike Kupfer <kupfer> sage% ls -ldg jamin drwxrwxr-x 3 jamin oldstaff 4608 Feb 14 00:03 jamin/ sage% ls -lga jamin total 0 sage% Log-Number: 32158 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 26 Feb 1992 17:50:19 PST Subject: Re: /user1/jamin is messed up Sorry this didn't make it to bugs earlier. Jamin's home directory got overwritten with what looks to be part of John O's mail. I didn't dare fiddle with it for fear of losing John's mail box for the umpteenth time. The overwritting is undoubtedly due to the bug that causes those "read from clean segment" messages. Eventually the segment gets reused even though it isn't clean. I assume that the directory was in one of these segments, which was subsequently used to store one of John's files. We need to fix this bug. Perhaps we could put in a short-term fix that would mark the supposedly clean segment as actually being dirty? We also need a program ala fscheck that will fix up an lfs that has gotten out of wack. John Log-Number: 32178 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 3 Mar 1992 10:12:15 PST Subject: more lfs "read from clean segment" problems Lust died due to a read from clean segment. Undoubtedly whatever it was trying to read will be overwritten when the segment is reused. I believe that the file system was /sprite/src. Be on the lookout for trashed files and/or directories. John Log-Number: 32048 Date: Tue, 28 Jan 92 13:14:55 -0800 From: dlong@cse.ucsc.edu Subject: more on LFS panic I'm using the 1.109 kernel now, and I'm getting a different panic message. It's a "bad descriptor magic number" on my /sprite partition. Accesses to files under certain directories, namely /sprite/spool, seem to trigger the panic. You guys have it setup so that you can run kgdb on allspice from a non-sprite machine, right? How hard is that to setup? That would really help a lot. dl Log-Number: 32050 Subject: "man -k" is case sensitive Date: Tue, 28 Jan 92 15:25:46 PST From: Mike Kupfer <kupfer> It would be nice if "man -k" ignored case, so that "man -k postscript" would match entries like f2ps (x) - Fig to Postscript translator psnup (local) - Insert N-up code into Postscript files mike Log-Number: 32052 From: Fred Douglis <douglis@MITL.COM> Subject: getting X running Date: Wed, 29 Jan 92 17:17:19 -0500 any idea why i'd get "/dev/mouse: no such device" when I try to access it on sprite.mitl.com? Works fine on ucb sprite. I tried recreating /dev/mouse.ds3100 from scratch using fsmakedev, but that made no difference. Accessing /dev/mouse.ds3100 directly also made no difference. It's as though the kernel I'm running (JHH.2362) doesn't recognize the mouse (12,1) as a device, which is hard to believe. Fred Log-Number: 32053 Subject: mysterious pmake failures not exterminated Date: Wed, 29 Jan 92 23:11:24 PST From: Mike Kupfer <kupfer> I'm still seeing occasional inexplicable pmake failures on DECstations. The job chugs along like normal and then for no apparent reason a compilation dies with no message other than "Error code 16". --- ds3100.md/vmFsCache.o --- rm -f ds3100.md/vmFsCache.o Rmt_Done(host=57, priority=1) called. Process 70926 exited. *** Error code 16 --- ds3100.md/vmMsgQueue.o --- Host 57 is clove. "rup clove" shows HOST TYPE STATUS UP/DOWN OFFICE LOAD IDLE KERNEL clove* ds5000 avail 6+09:09 E 508-5 0.00 0+01:56 1.109 "ls -l" of other .o's shows that the failure happened around an hour ago. So (a) it doesn't appear related to rebooting clove; (b) it doesn't appear to be related to eviction; (c) it doesn't appear to be a problem with non-standard kernels--both piracy (the home machine) and clove are running 1.109. I'll try to see if there is anything interesting in clove's syslog... mike Log-Number: 32054 Subject: Re: mysterious pmake failures not exterminated Date: Thu, 30 Jan 92 13:10:11 PST From: Mike Kupfer <kupfer> There are a couple related messages in the Sprite log. One, there were some similar problems with jobs mysteriously dying with error code 1. This problem was apparently tracked down and fixed. Two, there were some problems with the mysterious error code 16 in January and October 1990, but nobody ever gave a complete explanation for the problem, and there's no indication it was ever fixed. I asked Ann to send me information from clove's syslog; her reply is below. Hypothesis #1 is that the problem is related to doing recovery with a file server. I checked piracy's syslog, and it, too, had gone through recovery with raid1 during the time in question, though as with clove, there's no good indication of what time this all happened. Hypothesis #2 is that the remote machine for some reason thinks that the process on the home machine had died, even if it hasn't. Unfortunately, the error message in ProcRemoteWait doesn't give the peer process ID, so this is hard to confirm. mike -- Date: Thu, 30 Jan 92 11:38:52 PST >From: alc (Ann L. Chervenak) Message-Id: <9201301938.AA211257@sprite.Berkeley.EDU> To: kupfer Subject: Re: could you check clove's syslog? Here is what was in the syslog windo after 19:50 last night and before 03:58 this morning: ProcRemoteWait killing process e3934: home node's copy died. open of "/r4" waiting for recovery Fsprefix_HandleClose deleting "/r4" Broadcasting for server of "/r4" Importing "/r4" from raid1 open of "/r3/jclee/traces/tracelist.do.es.na.to.eq.ma.sp.xl" waiting for recovery Fsprefix_HandleClose deleting "/r3" Broadcasting for server of "/r3" Importing "/r3" from raid1 LE ethernet: Too many collisions. LE ethernet: Too many collisions. Log-Number: 32111 Subject: more on pmake and "Error Code 16" Date: Wed, 12 Feb 92 19:28:37 PST From: Mike Kupfer <kupfer> I ran into our friend the mysterious Error Code 16 again last night and this afternoon, so I've made up a private kernel that has additional printf's to track down where the error is coming from. As near as I can tell, it's because ProcRemoteWait thinks that the peer process on the home node has disappeared, but I don't know why it thinks that. Unfortunately, this kernel will have to be fairly widely used for the printf's to do much good. Should I just check in the changes and wait until we put out a new kernel? mike Log-Number: 32058 Subject: stuck timer queue eventually killed Allspice Date: Fri, 31 Jan 92 18:27:34 PST From: Mike Kupfer <kupfer> Allspice got a stuck timer queue. I dropped it into the monitor and continued it. It started to go through recovery and then stopped stone dead (for no apparent reason). I reset it and rebooted. mike Log-Number: 32065 Date: Sun, 2 Feb 92 17:55:24 PST From: mani (Mani Varadarajan) Subject: more on cory sprite failure (king) as i reported before, rpcs to king hang soon after allspice reboots. the last message in king's syslog is "migd: write to global daemon failed". i have to reboot king, but this doesn't fix the problem on the client aix. it also needs to be rebooted. mani Log-Number: 32126 Date: Sun, 16 Feb 92 21:59:03 PST From: mani@zabriskie.Berkeley.EDU (Mani Varadarajan) Subject: aix hangs due to reference to /r3 i did an ``ls -ld'' in /users locally (cory sprite), and aix hung, with the syslog message ``Contacting server 77 for "/r3" prefix'', since some of the old accounts still are symbolically linked to /r3. this never times out. this also used to happen whenever /scratch1 (a nonexistent disk) was referenced, but jim took out all references to that when he updated the sprite here. now that raid1 is no longer in service, i guess these references also should removed, in lieu of a more lasting solution. is there a quick fix for this? mani Log-Number: 32061 Date: Sat, 1 Feb 92 15:11:41 PST From: shirriff (Ken Shirriff) Subject: Allspice name server problem Allspice has some weird name server problem after being rebooted. Anything that uses gethostbyname ends up hanging and waiting for the name server to respond. There was some problem before the reboot, too, because I couldn't ping agate or melvyl from sprite, but they were accessible from other machines. Ken Log-Number: 32063 Date: Sat, 1 Feb 92 16:31:57 PST From: shirriff (Ken Shirriff) Subject: More about name server problems I've looked into the problems some more. On a normal sprite machine, the name server works, but pings don't work. For instance, if I do "ping agate", it resolves to 128.?.?.?, but the ping wedges in the recvfrom. However, if I restart the ipServer, the name server fails. Then a "ping agate" never even gets resolved to 128.?.?.?. Looking at gethostbyname, the problem is all the name server requests time out. These are the same requests that worked before I restarted the ipServer. So I don't know what's going on. Ken Log-Number: 32068 Date: Mon, 3 Feb 92 11:42:15 PST From: pmchen (Peter M. Chen) Subject: arson hanging pmake Arson was hanging a pmake of mine, so I disabled migration on arson (migcmd -I none). Here is what was going on on arson at the time (don't know why it was listed as available for migration with such a high CPU load). As far as I could tell, it wasn't just slow (I waited for a minute or two). Pete arson% top USER PID %CPU %MEM SIZE RSS STATE TIME PR COMMAND jclee 43c45 91.1 0.8 524 248 READY 19:46 < csim -s0 -p -i1k -d1k ... jclee 3c4a 83.5 0.7 520 244 READY 19:57 < csim -s0 -p -i1k -d1k ... pmchen 33c52 7.2 2.5 976 820 READY 0:01 ccom -EL -Xg2 -O1 ... pmchen 23c50 2.0 0.8 280 276 READY 0:01 -csh root 13c4d 1.2 0.5 176 168 RWAIT 0:00 rlogind root 13c1b 1.0 1.6 588 540 RWAIT 2:07 /sprite/daemons/ipServer pmchen 23c5e 0.3 0.5 280 160 READY 0:00 -csh pmchen 23c5d 0.3 0.5 244 164 RUN 0:00 ps -au pmchen 13c5c 0.2 0.4 124 116 RWAIT 0:01 cc -c -g -DSPRITE -o ... Log-Number: 32069 From: Fred Douglis <douglis@MITL.COM> Subject: core leak? Date: Tue, 04 Feb 92 10:12:24 -0500 My sprite machine crashed overnight with an out-of-memory error. Nothing interesting was going on, or should have been, and it had been up for only a half a day. There was a message about cleaning before the panic, but no way of knowing how much time passed. Are there known core leaks (relating to segment cleaning, for example)? Fred Log-Number: 32070 Date: Tue, 4 Feb 92 12:26:31 PST From: bmiller (Bob Miller) Subject: printer problem There's been an overall problem with some new printer software, but can someone check to see if it's not Sprite hanging up lw533? Here's the message I'm getting... subversion.Berkeley.EDU: waiting for queue to be enabled on shallot Bob Log-Number: 32071 Date: Tue, 4 Feb 92 16:39:28 PST From: bmiller (Bob Miller) Subject: more on printer problem... print jobs can be sent from shallot (a non-Sprite machine) and they print normally. I cannot print from subversion and Prof. Culler tried to send a print job from cardamom. Both jobs are sitting in the queue. Lpq on shallot shows no entries. [5-Mar-1992 (from the Sprite meeting): the fix here is to remove the lock file in the spool directory. -mdk] Log-Number: 32074 From: johnw (John Wawrzynek) Subject: xserver problem Date: Wed, 05 Feb 92 15:13:34 PST It seems that no X client can open a window on gluttony: Xlib: connection to "gluttony:0.0" refused by server Xlib: Client is not authorized to connect to Server Error: Can't open display: gluttony:0 This is after typing "xhost +" at the console: all hosts being allowed (access control disabled). /ultrix/cmds.ds3100/xhost: must be on local machine to enable or disable access control. Thanks. Log-Number: 32078 Date: Wed, 5 Feb 92 21:51:47 PST From: shirriff (Ken Shirriff) Subject: out of control telnet I found the following telnet out of control on allspice: ddgarcia 70e50 74.6 0.1 168 104 READY2680:56 telnet clove I think telnet has done this before. I took a stack trace, but I couldn't figure out the problem: #0 0x126c0 in ioctl (fd=-209904, request=469758588, buf=(char *) 0x0) (ioctl.c line 478) #1 0xae70 in inet_addr (...) (...) #2 0x85d4 in wontoption () #3 0x5f84 in netflush () #4 0xaf2c in fflush (...) (...) #5 0x1bfff590 in ?? () #6 0x12a20 in ioctl (fd=294096, request=35, buf=(char *) 0xf60a142c <Address 0xf60a142c out of bounds>) (ioctl.c line 478) #7 0x10800 in bind (...) (...) #8 0xb29c in bcopy (...) (...) #9 0x8b4c in telrcv () #10 0x121d8 in ioctl (fd=0, request=8192, buf=(char *) 0x0) (ioctl.c line 254) #11 0xd0c8 in res_send (...) (...) #12 0x4760 in ?? () #13 0x4e24 in tn () Ken Log-Number: 32080 Date: Thu, 6 Feb 92 13:33:18 -0800 From: dlong@cats.UCSC.EDU (Dean R. E. Long) Subject: Re: Security problem tftp logging is broke, so if someone tries to grab /etc/passwd, you'll never know. The problem is tftpd changes its uid to "guest", so it can no longer append to the log file. The solution I use is to open the log file while tftpd is running as root, and keep it open. dl Log-Number: 32081 Subject: potentially lost mail Date: Thu, 06 Feb 92 13:51:52 PST From: Mike Kupfer <kupfer> I just now tried to resort my inbox and got told that I had a dozen messages that consisted solely of ASCII nulls. It's possible that any mail you sent me while I was gone got nuked. I suspect that this is related to the fact that Piracy died while in the middle of reading mail, so I tried again on arson (which is the machine I'm on now). So, if there's mail that you sent Sunday or later and you think I should see it, please resend it (assuming you still have a copy). thanks, mike Log-Number: 32082 Date: Thu, 6 Feb 92 13:59:58 PST From: kupfer (Mike Kupfer) Subject: setuid bit for ds3100 "at" Does anyone know why /sprite/cmds.ds3100/at wasn't setuid? mike Log-Number: 32086 Date: Fri, 7 Feb 92 11:37:43 PST From: shirriff (Ken Shirriff) Subject: Allspice wedged When I got in today, my machine was sort of wedged up, and X wouldn't bring up windows properly. Rup and la just hung. I looked at allspice and it seemed to be wedged up too. I did a L1-i, which not surprisingly killed it, so I rebooted. Ken Log-Number: 32096 Date: Mon, 10 Feb 92 15:43:28 PST From: shirriff (Ken Shirriff) Subject: ftp logging problem fixed The ftp logging problem seemed to be a disk full problem of some sort. df said: Prefix Server KBytes Used Avail % Used / allspice 495968 420434 25937 94% but when ftpd tried to write back the ftpdlog, it got disk full messages. I freed up some space on / and now the logging seems to work. Ken Log-Number: 32118 Date: Thu, 13 Feb 92 23:35:14 PST From: shirriff (Ken Shirriff) Subject: Bogus out of disk space messages Every 5 seconds, piracy prints: 2/13/92 23:31:49 allspice (14) RmtFile "/sprite/admin/migd/global-log" <10,9620> Write-back failed: out of disk space<40008> I've cleared out megabytes of disk space: <ks piracy 2:6> df /sprite/admin/migd/global-log Prefix Server KBytes Used Avail % Used / allspice 495968 396125 50246 88% and the file isn't very big: <ks piracy 2:5> ls -l /sprite/admin/migd/global-log -rw-rw-r-- 1 kupfer 835181 Feb 13 22:28 /sprite/admin/migd/global-log So why does piracy keep pestering me??? Ken Log-Number: 32098 Date: Tue, 11 Feb 92 10:20:49 PST From: culler (David Culler) Subject: xdvi doesn't work on ds3100 if dvi has multiple pages latex hw This is Common TeX, Version 2.9 (no format preloaded) (./hw.tex LaTeX Version 2.09 - Released 27 October 1986 (/lib/tex/article.sty Document Style `article'. Released 4 September 1986. (/lib/tex/art11.sty)) (./cs267.sty) (./hw.aux) [1] [2] (./hw.aux) Output written on hw.dvi (2 pages, 4856 bytes). Transcript written on hw.log. cardamom:handouts> xdvi hw & [2] 4274d cardamom:handouts> xdvi: DVI file corrupted [2] Exit 1 xdvi hw Log-Number: 32108 Date: Wed, 12 Feb 92 16:44:40 PST From: eklee (Edward K. Lee) Subject: Re: xdvi/latex problems I remade the tex formats taking care that it got all its input files from /lib/tex but this did not fix the xdvi problem. I tried the following combinations: latex on sprite with sprite latex latex on sprite with sww latex latex on sunos with sww latex In all cases, the resulting dvi file could be previewed with the sww xdvi but not with the sprite xdvi. Similar xdvi problem occur on sun4's. Xdvi gets hung-up while opening the file /sprite/lib/fonts/pk/cmtt10.1642pxl in the routine sleepx. 0 sleepx(0x460f10, 0x7ddff3e0, 0x41a510, 0x100320e4, 0x45a40c) [0x460d28] 1 open.open(0x0, 0x45ae90, 0x0, 0x0, 0x0) ["open.c":93, 0x4615c8] 2 fopen.fopen(0x100163f4, 0x66a, 0x66a, 0x0, 0x0) ["fopen.c":64, 0x45a408] 3 formatted_open(path = 0x30, font = 0x10035c68 = "cmtt10", pxl = 0x10016404 = "pxl", mag = 1642, name = 0x10046620, count = 2) ["pxl_open.c":448, 0x405730] I think the problem may have to do with the tex fonts on sprite. Anyways, the sww version of xdvi does work under sprite. Ed Log-Number: 32109 Subject: how frequent are LFS checkpoints? Date: Wed, 12 Feb 92 17:31:14 PST From: Mike Kupfer <kupfer> When allspice was rebooted, I found that my freshly created /sprite/src/kernel/kupfer/proc subtree had vanished. I think I'd been working in it for at least 5 minutes, so I was a bit surprised to see the entire tree gone. mike Log-Number: 32121 Subject: Re: how frequent are LFS checkpoints? Date: Fri, 14 Feb 92 16:14:11 -0800 From: mendel@leland.Stanford.EDU > > When allspice was rebooted, I found that my freshly created > /sprite/src/kernel/kupfer/proc subtree had vanished. I think I'd been > working in it for at least 5 minutes, so I was a bit surprised to see > the entire tree gone. > > mike The checkpoint interval on the sprite/src/kernel is set to 60 seconds. It is surprising that a 5 minute old directory disappeared. It occurs to me that the checkpoint stuff use the Proc_CallFunc mechanism so if the call back timer stops so do the checkpoints. Did the machine die because of the call back queue messup? Mendel Log-Number: 32110 Date: Wed, 12 Feb 92 16:55:11 PST From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Allspice crash Allspice crashed with "list pointers are invalid". The core dump failed because I ran out of disk space. The problem may have been due to me testing my prefix changes a while ago. Ken Log-Number: 32123 Subject: forgery out of memory? Date: Fri, 14 Feb 92 17:53:30 PST From: Mike Kupfer <kupfer> Forgery is complaining about being out of memory. "vmstat -t 5" shows something like AVAIL FREE USER KMEM KSTK FS$ PF-NUM PF-SWP PF-FS POUTS 32768 224 12804 5048 1704 10332 1653576 1876 20336 5037 32768 208 12820 5048 1704 10332 33 0 1 0 32768 208 12820 5048 1704 10332 24 0 0 0 32768 208 12820 5048 1704 10332 24 0 0 0 32768 208 12820 5048 1704 10332 55 0 0 0 32768 208 12820 5048 1704 10332 25 0 0 0 32768 208 12820 5048 1704 10332 25 0 0 0 32768 208 12820 5048 1704 10332 24 0 0 0 About half the user pages are dirty. I don't know why they aren't getting written back, and I don't know why the FS cache is staying the same size. I walked down to 508-5, but there wasn't anything elucidating in the syslog. I tried "vmcmd -a 1 -A 0", and that didn't help. I tried "fscmd -f", and that didn't seem to do anything, either. (There is a foreign csim running, but it seems to be occupying very little memory. Most of the memory is taken by the X server and various instances of emacs, mx, and tx.) A related gripe: it appears that it is easy to set VM and FS parameters, but it's not so easy to find out what their current values are, especially if you aren't sitting at the console. mike [5-Mar-1992: the problem was that forgery was out of VM segments, not out of memory. -mdk] Log-Number: 32125 Date: Sat, 15 Feb 92 21:32:46 PST From: mani (Mani Varadarajan) Subject: finger not reporting properly i fingered myself while logged into cardamom, and it doesnt show me as being logged into a sprite machine. i'm not sure if this is exactly relevant to this problem, but migd wasnt running on cardamom. shouldn't it get automatically restarted? mani Log-Number: 32143 Subject: Re: finger not reporting properly Date: Fri, 21 Feb 92 19:28:10 PST From: Mike Kupfer <kupfer> Oops, I got confused about what you were saying was broken and ended testing the wrong thing. Anyway, the answer is that the problem was because migd on cardamom was dead. We don't currently have anything set up to restart dead migd's. It's normally not a problem. mike Log-Number: 32128 Date: Tue, 18 Feb 92 13:16:59 PST From: pmchen (Peter M. Chen) Subject: file system deadlock I'm running the PMCHEN kernel on mustard (ds5000). This is the same as the 1.109 kernel, but with the raid device driver added in. I mounted 3 disks running a raid 0. My program issues multiple I/O's, using multiple (3 in this case) processes to a set of files. All files are accessed by all processes. I've gotten several deadlocks, where all processes hang in the WAIT state. This is non-determinstic, of course, as with most deadlocks. But it does seem to happen about once a day if I pound on the system hard enough. Are there any known deadlock problems? I haven't tried this with a non-raid file system, but I will in a bit. I noticed that the processes tend to hang when the working set is much larger than the file cache. So it might have something to do with locks at the device level, not at the file cache level. Or, it may be just that the file cache lock is held longer, since it's going to disk frequently. Pete [5-Mar-1992: this looks like the problems we had last year with raid1 apparently losing I/O responses and hanging. -mdk] Log-Number: 32130 Date: Wed, 19 Feb 1992 02:11:05 -0800 From: "Dean R. E. Long" <dlong@cse.ucsc.edu> Subject: sendmail fix Besides the accept() system call in getrequests(), the socket() system call [actually Fs_Open on /hosts/.../netTCP] can also be interrupted by a signal, causing sendmail to exit. This can happen when you invoke sendmail with the -q option on the server. The child process that runs through the queue can exit during socket() in getrequests() if the queue is empty, causing a SIGCHLD. Wrapping the socket() call in a do {} while (fd==-1 && errno==INTR) fixes the problem. Also, rebuilding all the object files might be a good idea. While debugging the SIGCHLD problem, sendmail got stuck in the wait() call of reapchild(). I rebuilt conf.o, which seemed to fix it. Do to an #ifdef, reapchild() used wait3() after the recompile. Was wait3() recently added to the libc? dl Log-Number: 32145 Subject: Re: sendmail fix Date: Sun, 23 Feb 92 17:11:57 PST From: Mike Kupfer <kupfer> > Wrapping the socket() call in a do {} while (fd==-1 && errno==INTR) > fixes the problem. It would probably be better to fix socket() (in all its assorted incarnations, including the binary compatibility routine(s)) so that it retries if there was a signal. I don't think vanilla UNIX socket() can get interrupted by a signal, so most UNIX programs won't retry. Dunno why sendmail used wait() before but uses wait3() now. mike Log-Number: 32146 From: Fred Douglis <douglis@MITL.COM> Subject: Re: sprite arp, mop problems continue Date: Sun, 23 Feb 92 20:15:39 -0500 >>>>> On Fri, 21 Feb 92 15:44:19 PST, shirriff@sprite.Berkeley.EDU >>>>> (Ken Shirriff) said: Ken> Maybe packets are getting lost or delayed in the network? Ken> You could try using tcpdump to watch the packets going back Ken> and forth to see what's happening or you could add some Ken> debugging statements to mopd. Aargh! It seems that timeouts were accounting for the problems with both mopd and reverse arp. I had already tried tcpdump but it wasn't clear what might be happening. However, I did look more closely at the code for mopd, and found that I could just invoke it by hand with "-d -d" to enable all its debugging. Seems it was getting a timeout, which it did not recover from whatsoever. So I just increased the "alarm(2)" to "alarm(5)" and voila, I could download a boot kernel the next time I tried and every time since. I next tried to figure out what was going on with reverse arp. Seems that the code that matches an ethernet address to a sprite ID does one break too few, which means that the for loop goes all the way up to netNumHosts even after a match. Maybe my timers are messed up, or maybe my ds5000s run slowly, or something, but it seems that by the time the server was replying the client had already given up. Perhaps it was dumb luck that once in a while it would not time out yet, or perhaps request #N would get a response from request #N-1. Who knows. However, by changing the constant in the arp code so it times out after 900ms instead of 500 ms (just replacing one arbitrary number with another :-), I was able to boot my sprite client for the first time in a couple of days. My guess at this point is that Ken hit the nail on the head, and our network is somehow more loaded than yours is. Those timeout constants are certainly black magic and not very forgiving! And the idea that the kernel does 4 reverse arps, gets 4 timeouts, and just keeps going without its spriteID set is just plain silly. Rather than waiting forever for the server of "/" it should keep trying to get its spriteID before going on. Fred Log-Number: 32148 Date: Mon, 24 Feb 92 21:18:51 PST From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) ~s Allspice crashed: LFS problem Allspice crashed running the MB.650 kernel. A core dump is in vmcore.mary. The symptoms were a whole bunch of Lfs_StartWriteBack called, fileFschecked is 0 followed by Lfs_SetSegUsage: Warning activeBytes for segment 1119 is -3045 Fatal Error: LfsSetSegUsage called for a clean segment A couple minutes before the crash, I restarted the IP server on allspice because the ftp daemon wasn't working, but I don't know if this is related. Ken Log-Number: 32150 From: mgbaker (Mary Gray Baker) Subject: xwaisq problem/question Date: Mon, 24 Feb 92 22:53:04 PST Xwaisq is very useful, except that I'm having trouble figuring out how to find the message numbers for the sprite log. They don't seem to be anywhere on the display. Mary Log-Number: 32151 Date: Mon, 24 Feb 92 23:00:11 PST From: shirriff (Ken Shirriff) Subject: Re: xwaisq problem/question I think the reason the numbers aren't visible is due to the structure of wais. Basically you are a client, making requests to the wais server. The message numbers are the filenames, which are normally only relevant to the server (since you can't access the server's files directlry), so they don't get passed along. I guess to fix this would require changing the wais server so it sticks the Sprite message ID into the returned document somewhere. Ken Log-Number: 32156 Subject: Re: writing a large directory hierarchy under LFS Date: Tue, 25 Feb 92 16:06:17 -0800 From: mendel@Niihau.Stanford.EDU.Stanford.EDU > I've had trouble with sprite crashing when I try to run "update" to > copy lots of files. The problems manifest themselves in various ways: > locked cache blocks that wedge the system, garbaged data structures, > etc. > > If you took a large LFS disk /a and updated it to an empty LFS disk > /b, via "update /a/. /b/a", would it finish? > > This has been true with a vanilla kernel (e.g. JHH.2362, 1/9/92) so > it's presumably nothing I'm doing. But maybe I'm missing something. > If it's a "known problem" and the answer is simply "never write too > many LFS files at a time", then I'll shrug it off. If it's something > unexpected, then I guess someone should file this away, um, I mean fix > it. (semi :-) How much memory do you have on the machine? It's possible that you are running into problem that don't happen on the Berkeley Sprite file servers because of the large amount of memory. > > One more LFS question: if I run chmod, then sync, then this update, > and I crash, the chmod doesn't take effect. Does sync not force a > checkpoint, and if not, is there a user-level command to force one? > > Fred Yes, sync causes an LFS checkpoint to occur. Mendel Log-Number: 32160 From: Fred Douglis <douglis@MITL.COM> Subject: rarp problem located: Net_AddrToId bug Date: Thu, 27 Feb 92 15:25:40 -0500 I noticed that even though the mopd problem I reported earlier seemed to have something to do with timeouts and dropped packets, the fact that a diskless client couldn't boot was not explained. I had stopped worrying about it, since I had a local disk and could set the spriteID from the disk header, but then I ran into a "network sniffer" salesman and tried to use that to explain the strange goings-on. It didn't, but on further inspection, I figured it out: Net_AddrToId not only was missing a break, as I said before, but it was also doing a compare on the entire 8-byte Net_Address structure rather than just the 6 bytes in the ethernet address. Perhaps in your environment those other two bytes always match, but that was not the case here. Changing the test to look at the network type appears to fix the problem. Fred Log-Number: 32162 Subject: incorrect default handling for some UNIX signals Date: Thu, 27 Feb 92 14:48:04 PST From: Mike Kupfer <kupfer> By default, UNIX signals that don't map to regular Sprite signals are ignored. However, except for SIGWINCH, the UNIX default for all of those signals should be the termination of the process. In the case of SIGUSR1, this incorrect behavior breaks "find". (If you say find . -exec NoSuchProg {} \; then "find" should stop the first time it tries to invoke the non-existent program. Under the current setup, it just plows along, generating a slew of error messages.) The signals that are broken by this bug are SIGIOT SIGEMT SIGSYS SIGXCPU SIGXFSZ SIGVTALRM SIGPROF SIGUSR1 SIGUSR2 mike P.S. There are two parts to this bug. The first part is in Sig_Init, which defaults the action for non-zero signal numbers to be SIG_IGNORE_ACTION. The second part is in Sig_SendProc, which ignores signal 0 (SIGSYS, SIGXCPU, SIGXFSZ, SIGVTALRM, and SIGPROF). Log-Number: 32164 Subject: lust crash: /pcs hardware error Date: Thu, 27 Feb 92 21:20:23 PST From: Mike Kupfer <kupfer> Lust crashed with Warning: SCSI Disk SCSI#0 Target 4 LUN 0 error: media error - info bytes 0x0 0x0 0xd6 0xef Fatal Error: LfsError: on /pcs status 0x70008, Can't write segment to log I checked the cabling on the disk and rebooted. mike Log-Number: 32168 Date: Fri, 28 Feb 92 15:29:48 PST From: shirriff@ginger.Berkeley.EDU (Ken Shirriff) Subject: Allspice media error crash Allspice crashed after getting a media error on /sprite/src/kernel. I put a core dump in vmcore.media. The console said: SCSI Disk #3 target 2 LUN 0 media error 0x0 0x2b 0x3d 0x27 LfsError on /sprite/src/kernel: Can't write segment to log. Ken Log-Number: 32169 Date: Fri, 28 Feb 92 19:26:40 PST From: mottsmth (Jim Mott-Smith) Subject: 'Permission denied' message from update. Often, when I update my files to ginger, update gives me a 'permission denied' message for no apparent reason. In this example, the first two attempts failed but the third one's a charm... > sabotage:~mottsmth/j/jaq> update indx.c /home/ginger/users/mottsmth/j/jaq/indx.c > Updating: /home/ginger/users/mottsmth/j/jaq/indx.c > Couldn't rename "/home/ginger/users/mottsmth/j/jaq/indx.c" to "/home/ginger/users/mottsmth/j/jaq/indx.cXXX": permission denied. > sabotage:~mottsmth/j/jaq> update indx.c /home/ginger/users/mottsmth/j/jaq/indx.c > Updating: /home/ginger/users/mottsmth/j/jaq/indx.c > Couldn't rename "/home/ginger/users/mottsmth/j/jaq/indx.c" to "/home/ginger/users/mottsmth/j/jaq/indx.cXXX": permission denied. > sabotage:~mottsmth/j/jaq> !! > update indx.c /home/ginger/users/mottsmth/j/jaq/indx.c > Updating: /home/ginger/users/mottsmth/j/jaq/indx.c > Anybody know what's going on? -- Jim M-S Log-Number: 32170 Date: Sat, 29 Feb 92 18:19:24 PST From: shirriff (Ken Shirriff) Subject: gcc compiler chokes on INT_MIN If you assign INT_MIN to a double, it comes out as INT_MAX. This is very bad if a program initializes a value to INT_MIN and then loops over values to find the maximum. e.g. #include <limits.h> main() { double x; int y; x = INT_MIN; if (x>0) { printf("%f\n",x); } } Output on a sun4 is: 2147483647.999985 Ken Log-Number: 32172 From: mgbaker (Mary Gray Baker) Subject: Server failures, reincarnation, alpha particles and leap year Date: Sun, 01 Mar 92 00:13:59 PST Okay. I've tried to respond in a sensible sort of way to people who insist that things like leap day will cause massive earthquakes, stock market failures, or the disappearance of entire oceans. The sensible response is "nonsense." But here's what happened today, Feb. 29th. We have had network failures, disk failures, console display failures, and group simultaneous memory errors. Not to mention what happened to Jim this morning, but that's another story. Lust crashed this evening with a media error on /pcs. It got a short read error in LfsReadBytes. I opted for bringing up the machine without /pcs mounted. I know this causes recovery to fail, but I couldn't see how badly trashed things were, because lust's console chose to fail at that point. I got to see the LfsReadBytes message but not much else before the console went into this diagonal ziggly mode. I got the console back after turning it off for a bit and brought up the machine without /pcs. This way whatever data is there in the last good checkpoint won't be overwritten or cleaned or something before we can deal with it. Then I went over to allspice to figure out why it was spewing stuff out on its console so fast that the machine had practically come to a halt. It seemed to be the talkd process. I couldn't do anything from allspice, so I went back to lust's console to fix things remotely. At that point, lust got an ECC error. It said: ECC read error during CPU access of address 0x191578 ECC error was in the low bank single bit error syndrome bits = 0x10 check bits = 0x1c00 What surprised me is that it chose that moment to time out with allspice. So I went back over to allspice, and low and behold, it had suffered a memory error at the same time! It got a random cache flush error. Then it got a watchdog reset. So I brought up allspice. Then I bumped into Mike Olsen, and today he lost his disk with a large part of his Master's project on it, and he can't get at any of the software he was using in Cory because the gateway kerplunked itself. Maybe that charming cellist on Telegraph Ave. who tried to convince me that I should buy his special incense that wards off reincarnations of unpleasant spirits actually knew what he was talking about. Mary Log-Number: 32175 Date: Mon, 2 Mar 92 11:19:02 PST From: shirriff (Ken Shirriff) Subject: SWW X11 doesn't start up David Culler couldn't start up X this morning. He would get a gray background and then messages about waiting for the X server to respond. He was running X out of the software warehouse. I changed his path to use /X11/R4/cmds instead of sww and then X worked for him. Ken Log-Number: 32181 Subject: Re: problems with mail queue Date: Tue, 03 Mar 92 15:21:33 PST From: Mike Kupfer <kupfer> Probably *the* most common problem with the Sprite mail queue is that sendmail will lock something in the queue and then die, leaving the lock file around. (The "*" next to the mailq QID means that the message is locked.) A later "sendmail -q" does nothing because the lock file is theoretically an indication that some other sendmail process is working on that message. You can (1) wait for a nightly script to delete the lock file (this actually takes a couple days) (2) send mail to bugs, asking for someone to remove bogus lock files (3) remove the bogus lock files yourself (but only if you're sure there isn't currently a sendmail working on the message) mike Log-Number: 32186 Date: Wed, 4 Mar 92 08:10:06 PST From: ouster (John Ousterhout) Subject: Lust crash Lust crashed this morning with a SCSI error on /tmp. The message was something like "unsupported class7 error 0xb". -John- Log-Number: 32189 Date: Wed, 4 Mar 92 08:38:14 PST From: ouster (John Ousterhout) Subject: Third lust crash This time the message was: Fscache_UpdateAttFromClient 75: "(no name)" <2,1009> short size 62684 not 67840 By the way, I've commented out the mount line for /tmp in lust's bootcmds file; this will need to be undone when /tmp is eventually moved back to lust. -John- Log-Number: 32190 Subject: SWW epoch seg faults when asking for version Date: Wed, 04 Mar 92 11:27:43 PST From: Mike Kupfer <kupfer> I can start up /usr/sww/bin/epoch on sage, but if I ask for its version (ESC-x emacs-version <CR>), it segment faults. It works fine on shallot, so I assume it's a binary compatibility problem. Hopefully it's just a matter of not recognizing that it's a UNIX binary, but that's only a guess. mike Log-Number: 32192 Date: Wed, 4 Mar 92 19:11:46 PST From: kupfer@ginger.Berkeley.EDU (Mike Kupfer) Subject: lust crash: unsupported class7 error 0xb Lust died with what appears to be a hardware error on /user5. There were a bunch of error messages Warning: short async NFS write which I assume are irrelevant (but I'm not sure). In the middle of one of these messages there was Warning: SCSI Disk SCSI#0 Target 6 LUN 0 error: unsupported class7 error 0xb Fatal Error: LfsError: on /user5 status 0x50003, Can't write segment to log. When I tried to reboot lust, I got Bad segment summary magic in segment 1743 on /user5. I am currently using the programs in ~mendel/lfs/src/cmds to try to repair /user5. It would be nice if these programs (esp. lfscheck, lfsrebuild, and lfsrecov) were installed in /sprite/admin, had man pages, etc. mike Log-Number: 32194 Subject: Re: lust crash: unsupported class7 error 0xb Date: Wed, 04 Mar 92 20:52:45 PST From: Mike Kupfer <kupfer> If I'm reading the code and documentation right, this error is "aborted command". Indicates that the target aborted the command. The initiator may be able to recover by trying the command again. mike Log-Number: 32195 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 4 Mar 1992 21:14:03 PST Subject: Re: /pcs offline again The lfs* commands in Mendels directory are of unknown reliability. I'm not sure how well tested they are. If Mendel can confirm that they are reasonably reliable then we should install them in /sprite/admin. Note, however, that none of them are an lfs equivalent of fscheck, i.e. it will repair problems in the file system and move unreachable files to /lost+found. John Log-Number: 32197 Subject: lfsrebuild bugs Date: Wed, 04 Mar 92 22:28:46 PST From: Mike Kupfer <kupfer> (1) the help message gives the same explanation for both "-verbose" and "-oldcp". I assume that the explanation for "oldcp" should be something like "use the old checkpoint". (2) When lfsrebuild deletes a file, it doesn't do it cleanly. Here are a couple messages from when it fixed up /user5: File kupfer/cmds.sprited.ds3100/psh references non-allocated descriptor 47783. File Deleted. Entry psh (2) now has nameLength 3, recordLength 12, fileNumber 0. File kupfer/cmds.sprited.ds3100/ln references non-allocated descriptor 47769. File Deleted. Entry ln (5) now has nameLength 2, recordLength 12, fileNumber 0. Unfortunately, the entry for the file is apparently left in the directory. This leads to things like sage% ls cmds.sprited.ds3100 cmds.sprited.ds3100/psh not found cmds.sprited.ds3100/ln not found chmod* kill* mv* rmdir* tail* cp* loop* ps* setjmp* world* fault* ls* pwd* suicide* find* mkdir* rm* sync* and find: bad status < ./kupfer/cmds.sprited.ds3100/psh > find: bad status < ./kupfer/cmds.sprited.ds3100/ln > Additional note: Assuming that the lfs repair programs don't ask for user input, it's probably a good idea to run them piped into tee, so that you can get a log of changes and errors. mike Log-Number: 32198 Subject: lust crash: address fault Date: Wed, 04 Mar 92 23:19:25 PST From: Mike Kupfer <kupfer> Lust died again, and for once it didn't have anything to do with LFS. Maybe. The message on the console was MachKernelExceptionHandler: Address error on load: addr: 7 PC: 800825a8 JHH tried to debug it, but his workstation ended up getting hung. According to gdb, the instruction at that PC is 0x800825a8 <Fslcl_DeleteFileDesc+988>: lw $t4,8($a2) According to "dis", the PC is line 2247 of fslclLookup.c (well, dis says 2245, but examing the surrounding code leads me to think 2247): if (uid == descPtr->uid) { My guess is that descPtr was NIL. I wonder if this has anything to do with the bogus directory entries that lfsrebuild left lying around. mike [5-March-1992: we also got the same crash this afternoon, right before the Sprite meeting. -mdk] Log-Number: 32199 Date: Thu, 5 Mar 92 08:24:55 PST From: ouster (John Ousterhout) Subject: Allspice-Lust lock-up Allspice and Lust embraced each other in deadly fashion this morning: Allspice was printing the message "Client 1 dropped 30 return-atts requests for "spritehosts"" every few seconds and Lust appeared to be waiting for responses from Allspice (there were RPC timeout messages on its console). Bob and I rebooted both machines. -John- Log-Number: 32201 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 5 Mar 1992 10:26:23 PST Subject: /user5 screwed up Looks like lfsrebuild had some problems fixing /user5, in addition to the ones Mike reported yesterday. The directory /lost+found/1932/59 also appears as /user5/rab/src/cmds/ee/sun4.md, and its ".." entry points to /user5/rab/src/cmds/ee. This confuses du. A "find . -print" in /user5 gives lots of bad status messages, as if a bunch of directories contain entries for non-existent files. I propose a multi-step solution: 1) change the kernel so that aborted operations are retried. 2) dump /user5 to tape 3) make a new file system 4) restore /user5 from tape 5) fix lfsrebuild John Log-Number: 32202 Subject: xwais doesn't use standard text widget? Date: Thu, 05 Mar 92 17:08:49 PST From: Mike Kupfer <kupfer> The text windows inside xwais act like regular Xtk text widgets, except they don't seem to pay attention to the resource definitions. In particular, I have a few keys rebound for text widgets (^W and meta-W), and the redefinition works for everything except xwais. mike Log-Number: 32203 Date: Thu, 5 Mar 92 17:11:28 PST From: mani (Mani Varadarajan) Subject: X crashed while compiling a program here, X all of a sudden died with the message "PdevWrite: signal 14". In addition, "Fs_PageCopy: Copy failed <40008>" appeared. Then there was a whole slew of inetd: select: I/O error messages, topped off by an "Exiting: Too many select errors". that's all the information i have, unfortunately. mani Log-Number: 32204 Subject: tar.gnu loses if verbose with really long names Date: Thu, 05 Mar 92 21:58:41 PST From: Mike Kupfer <kupfer> Posix raised the maximum filename length that tar can handle, but it still places an upper bound that is less than MAXPATHLEN. So Bob added a feature to the GNU tar where if it finds a file whose name is too long, it generates a (short) unique name for the file and uses that name instead. Unfortunately, this code is somewhat buggy. If you turn on "verbose" (e.g., "tar cfv foo.tar ..."), tar croaks when it tries to process a file with a too-long name. The upshot is that if you use the -v ("verbose") option to restore, you could end up putting tar.gnu into the debugger. The workaround is to not specify -v. mike Log-Number: 32207 Date: Thu, 5 Mar 92 22:47:38 PST From: shirriff (Ken Shirriff) Subject: Re: gcc compiler chokes on INT_MIN I checked and the problem is with the old version of gcc; sscanf handles INT_MIN perfectly, so no aspersions should be cast on the writer of sscanf. I compiled a program with the sww gcc (version 1.40) and linked in the sprite library and it worked fine. I compiled it with our gcc (version 1.37.1) and the assignment of INT_MIN failed. In both cases, sscanf read INT_MIN correctly. Thus, I conclude the problem is in gcc 1.37.1. Ken Log-Number: 32209 Date: Fri, 6 Mar 92 08:11:14 PST From: ouster (John Ousterhout) Subject: Re: gcc compiler chokes on INT_MIN In response to Ken's message: I checked and the problem is with the old version of gcc; sscanf handles INT_MIN perfectly, so no aspersions should be cast on the writer of sscanf. I compiled a program with the sww gcc (version 1.40) and linked in the sprite library and it worked fine. I compiled it with our gcc (version 1.37.1) and the assignment of INT_MIN failed. In both cases, sscanf read INT_MIN correctly. Thus, I conclude the problem is in gcc 1.37.1. Perhaps I'm mis-understanding what Ken did, but it seems to me that there could still be a problem with sscanf. The potentially-bad sscanf stuff occurs in the compiler, when it reads the INT_MIN characters and parses that into a number. Thus, compiling with sww gcc and linking the program with the Sprite library might not find the problem, since the bogus parsing occurs in the compiler itself, not in the resulting program. -John- Log-Number: 32211 Date: Fri, 6 Mar 92 11:49:47 PST From: shirriff (Ken Shirriff) Subject: Re: gcc compiler chokes on INT_MIN One of the things I did in my tests was to do sscanf on the INT_MIN string, using both %d and %f. Sscanf returned the proper value in both cases, so I conclude that sscanf handles INT_MIN properly. Ken Log-Number: 32212 Date: Fri, 6 Mar 92 22:24:21 PST From: kupfer (Mike Kupfer) Subject: LFS can consume all of memory when you move a disk We kept crashing hijack when we tried to attach /user5, which had previously been hooked up to lust. The problem is apparently that LFS records in the superblock an estimate of how much memory (number of blocks?) to use for cleaning. Lust has 128MB, hijack has 32MB, so when we tried to attach /user5, hijack would run out of memory. LFS should probably do a sanity check on the number from the superblock (take a look at how much memory is available on the machine) and do the Right Thing. mike Log-Number: 32215 Subject: glitches from /user5 restore Date: Sat, 07 Mar 92 23:24:48 PST From: Mike Kupfer <kupfer> Dear /user5 user: You may notice a couple of glitches from the recent work on /user5. First, some files (or directories) that you had deleted or renamed might have reappeared. Unfortunately, this problem is unavoidable, and you'll just have to re-remove or rename the suckers as you find them. Second, it appears that any files that had multiple hard links only got one link restored. xmh users will get bit by this because the xmh "copy" command actually just makes a link, rather than copying the message. This is apparently a bug in our backup/restore suite. I am debating hacking up a script to find my missing hard links. Let me know if you're interested in getting a list of missing links for your directory. mike Log-Number: 32216 Subject: lfsrebuild doesn't fix bad summary magic number Date: Sun, 08 Mar 92 18:08:47 PST From: Mike Kupfer <kupfer> I naively thought that because lfsrebuild complained about the bad magic number (in a segment summary block) on /user5, it would fix it. Mendel tells me that this is an incorrect assumption. I guess it just bails out or something. mike Log-Number: 32218 Subject: missing fonts for xproof on R5? Date: Sun, 08 Mar 92 20:21:25 PST From: Mike Kupfer <kupfer> I tried running /X11/R4/cmds/xproof on the output from "troff -ms", and xproof went into an infinite loop complaining about not being able to find -adobe-times-medium-r-*--*-100-75-75-mumble-adobe-fontspecific. When I tore down my R5 server (on sage) and restarted using R4, xproof worked fine. mike Log-Number: 32222 Date: Tue, 10 Mar 92 01:31:42 PST From: dlong (Dean Long) Subject: find patch for descending into mounted filesystems Right now find behaves the same no matter if -xdev is specified or not. Here is a quick patch that allows find to descend into filesystems. Specifying -xdev will still disable this feature. dl --------------------------------------------------------------------- *** /tmp/,RCSt1852521 Thu Jan 1 01:20:42 1970 --- find.c Thu Jan 1 01:20:36 1970 *************** *** 656,661 **** --- 656,664 ---- fprintf(stderr, "find: bad status < %s >\n", name); return(0); } + if ((Statb.st_mode & S_IFMT) == S_IFRLNK) { + stat(fname, &Statb); + } (*exlist->F)(exlist); if((Statb.st_mode&S_IFMT)!=S_IFDIR || !Xdev && Devstat.st_dev != Statb.st_dev) Log-Number: 32226 Subject: ~sprite/cmds.ds3100/lint on ginger Date: Wed, 11 Mar 92 13:04:57 PST From: Mike Kupfer <kupfer> There's a 1989 version of the lint front-end in ~sprite on ginger. I'm not sure what's so special about it, other than it automatically defines the "sprite" flag. Unfortunately, it's missing some features that the Ultrix 4.2 lint has, like automatically defining __LANGUAGE_C. If we no longer do much linting on dill, I think we should just flush the version in ~sprite. Otherwise, we should update it to have the features from the Ultrix lint. mike Log-Number: 32228 Date: Wed, 11 Mar 92 14:21:21 PST From: dlong (Dean Long) Subject: Sprite on Sun IPX The sun4c kernel runs almost flawlessly on a Sun 4/50 (IPX), which is basically a Sparc 2 in an IPC box. The only problem I saw was running X. The code for the cg6 frame buffer wasn't quite right. Hopefully I'll have a patch for it pretty soon. dl Log-Number: 32229 Subject: valloc & source compatibility Date: Wed, 11 Mar 92 14:38:29 PST From: Mike Kupfer <kupfer> We don't provide valloc in the C library, which is occasionally a stumbling block when importing new sources. Can anyone think of a reason for not just making valloc be a facade over malloc? (I'm assuming that our malloc returns page-aligned objects for sizes of at least one page; the man page was unclear on this point.) mike Log-Number: 32230 Subject: signal mask back door Date: Wed, 11 Mar 92 16:35:07 PST From: Mike Kupfer <kupfer> A user signal handler can mung the signals hold mask that's stored in the Sig_Context. Sig_Return takes whatever's stored there, without checking it against sigCanHoldMask. I don't think this is much of a back door, but it should probably be fixed some day. Note that merely ANDing the saved context mask with sigCanHoldMask can do the Wrong Thing. For example, if the SIGSEGV handler causes a floating point signal, then when the floating point handler returns, the hold mask should still have the segv bit turned on. mike Log-Number: 32232 Date: Thu, 12 Mar 92 09:04:20 PST From: sullivan (Mark Sullivan) Subject: Ofs_FileDescInit fetched non-free file descriptor Is this a known bug? I rebooted this morning after a VmPageRead error on a swap file and now I cannot create files on /postdev. The cshell tells me that "file already exists" whenever I try to do a shell command that involves creation of a file (e.g. touch, redirected stdout). The error message in the subject line is all over piracy's monitor, so I assume this is what OFS thinks is the problem. Incidently, df gives me some pretty wierd answers. I thought the disk might be full, but: Prefix Server KBytes Used Avail % Used /postdev piracy 309808 -579 279406 0% Mark Log-Number: 32233 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 12 Mar 1992 10:29:11 PST Subject: Re: Ofs_FileDescInit fetched non-free file descriptor Piracy needs to be rebooted. The entry for postdev was missing from the mount file, so fscheck wasn't being run on reboot. Running fscheck should take care of the problems (I hope). John Log-Number: 32234 Date: Thu, 12 Mar 92 11:41:20 -0800 From: dlong@cats.UCSC.EDU (Dean R. E. Long) Subject: Re: Ofs_FileDescInit fetched non-free file descriptor I also get the same error every once in a while after a crash, usually on the root filesystem. A quick fix seems to be: 1. mv <bad_dir> <tmp_dir> 2. mkdir <bad_dir> 3. mv <tmp_dir>/* dir 4. reboot and recheck filesystem Without steps 1-3, the future fscheck doesn't seem to fix the problem. My guess is there are still problems with fscheck running on / after it is mounted. dl Log-Number: 32238 Date: Thu, 12 Mar 92 23:53:37 PST From: voelker (Geoffrey M. Voelker) Subject: lust + /user6 /user6 filled up at around 11:30 tonight, and a few minutes later lust crashed (but they seem to be unrelated incidents). Lust's console read: Fatal Error: OutputPacket: packet too large (3686) Enterring debugger with a Breakpoint trap exception at PC 0x800e786c I rebooted lust, and removed some stuff in my home directory on /user6. It has about 4 megs free as of this message... -geoff Log-Number: 32240 Date: Fri, 13 Mar 92 22:36:58 PST From: mottsmth (Jim Mott-Smith) Subject: Sabotage and Sage wedged up running 1.110 Sabotage and Sage completely wedged up twice tonight running the new kernel. Mike and I debugged it as far as we could. Sage wedged up with everyone stuck waiting for a monitor lock held in Fsutil_WaitListNotify. The lock was held by a process waiting on an RPC to sabotage. Why the RPC didn't time out (since Sabotage was sick) is a mystery. The timer queue appeared normal. Sabotage's condition is also a mystery. Most of the waiting processes had event values = -1 so there was no obvious source of trouble. -- Jim M-S Log-Number: 32241 Date: Fri, 13 Mar 92 23:51:02 -0800 From: dlong@cats.UCSC.EDU (Dean R. E. Long) Subject: 1.110 kernel My 16M file server (don't laugh, the CPU board went out on the previous IPC file server for about the 5th time) ran out of memory before it got done booting with the 1.110 kernel. I had to go back to the 1.109 kernel. I think most of the memory is going to LFS reserving cache blocks for cleaning. I reset the maxNumCacheBlocks entry on all my LFS file systems to a lower number to save memory. Maybe the 1.110 is only treating it as a hint and using more memory than I want? dl Log-Number: 32243 Subject: kmsg and sun3's Date: Sun, 15 Mar 92 14:46:11 PST From: Mike Kupfer <kupfer> Has anyone been able to use kmsg with a sun3 recently? It seems like I always get a complaint about "Short read" when I try to kmsg a sun3 from a sun4. mike Log-Number: 32244 Date: Sun, 15 Mar 92 15:34:07 -0800 From: sullivan@postgres.Berkeley.EDU (Mark Sullivan) Subject: VmPageServerRead: non-existent swap file jhh tells me this is an LFS problem. The client machine's kernel panics when LFS munches a swap file. My kernel has tripped over this error several times in the last few weeks, but Jim has seen it on sabotage also. I know fixing LFS is going to be hard, but it would help me a lot if we could change the vm module to kill the offending process rather than panic. I have been running tests at night and I lose a lot of time if the kernel goes out. If anyone does fix this, please let me know since I will have to rebuild my kernel with the fix. Thanks, Mark Log-Number: 32262 Date: Sat, 21 Mar 92 10:50:47 PST From: sullivan (Mark Sullivan) Subject: VmPageServerRead non-existent swap file panic killed my kernel again last night. Also, I've been noticing ECC errors in the memory of arson and piracy. I usually only see them during reboot since I always log into the machines remotely and they only show up on the console. I've seen them on the console of arson also after programs died mysteriously. Mark Log-Number: 32264 Subject: Re: VmPageServerRead non-existent swap file Date: Sun, 22 Mar 92 00:02:05 PST From: Mike Kupfer <kupfer> I looked into changing VmPageServerRead to return VM_SWAP_ERROR instead of panicing. Almost all its callers either (a) nuke the process(es) that want the page or (b) mark the page as invalid (so presumably the process(es) will die with an address fault). There is one exception, though. VmSegCantCow calls VmCOWCopySeg, which calls COR, which calls VmPageServerRead. Unfortunately, VmSegCantCow ignores the return value from VmCOWCopySeg. On the other hand, it doesn't look like anyone actually uses VmSegCantCOW, so maybe I can just flush it? mike Log-Number: 32245 Date: Sun, 15 Mar 92 17:52:50 PST From: shirriff (Ken Shirriff) Subject: Runaway csh on lust Root had a runaway csh -i using 90% of the CPU. I tried to debug it, but dbx blew it away for some reason. Ken Log-Number: 32246 Date: Sun, 15 Mar 92 18:05:27 PST From: shirriff (Ken Shirriff) Subject: violence seems unreliable Violence crashed with an ethernet error and then I couldn't get it to reboot. I moved it to another desk and then it would reboot. I moved it downstairs and it wouldn't reboot. I powercycled it and then it would reboot. I don't know if it is tempermental or has some real ethernet problem. Ken Log-Number: 32249 Subject: misleading error status if file name component too big? Date: Mon, 16 Mar 92 18:42:37 PST From: Mike Kupfer <kupfer> Suppose you type "touch foo", where foo is a file name longer than FS_MAX_NAME_LENGTH (255) characters (which is the limit on a single component of a name). Fs_Open, which is called by creat() inside "touch", returns FS_FILE_NOT_FOUND. Shouldn't that be FS_INVALID_ARG? mike Log-Number: 32251 Date: Tue, 17 Mar 92 19:32:45 PST From: mani@villandry.berkeley.edu (Mani Varadarajan) Subject: Cory sprite hung when i came in this morning, king was in the debugger with the message: MachKernelExceptionHandler address error on load: addr: 100023 pc: 800eda08 Entering debugger with a TLB load address error exception at pc 0x800eda08. king is running the 1.109 kernel. i tried to attach to it to get a trace of what had happened, but i couldn't connect to it. mani Log-Number: 32254 Date: Thu, 19 Mar 92 02:22:16 PST From: gunter (Michial Gunter) Subject: Emacs crashes on ds5000's -- Illegal Instruction Actions speak louder than words: subversion[37]:~:2:11am:> emacs -r & [6] 15a7d subversion[38]:~:2:12am:> I type "ESC-x gnus" invoking the gnus news reader. It does some stuff then crashes - apparently during garbage collection. [6] + Illegal instruction emacs -r Notes: 1) On sun4s this doesn't seem to happen. 2) It crashed at a couple other times tonight. I would venture, then, that the problem area is not hit only by gnus. 3) I tried (for not too long) to reproduce this behavior when I run emacs with the -q (don't load init file) option. I wasn't able to. 4) This behavior is new --- probably new today. I'd be interested to hear what the problem is. thanks, mike Log-Number: 32258 Subject: Re: Emacs crashes on ds5000's -- Illegal Instruction Date: Thu, 19 Mar 92 15:44:30 PST From: Mike Kupfer <kupfer> Well, I looked at the dead Emacs on subversion and I tried again to duplicate the problem. What appears to be happening is that the garbage collector (the actual function is mark_object) finds an object with a bogus tag, so it panics. I su'd to you and couldn't duplicate the problem, even after sourcing your .login. I did an "su -" to you and the problem appeared. Further experimentation seems to point at something in ~kupfer/emacs/default.el that makes the problem go away (or at least postpones it). I'm afraid I don't have more time to look into this right now. I'll put it into my Todo list, but it's not likely I'll get to it any time soon. If you come up with more information to help pin down the bug, please send mail to bugs. thanks, mike Log-Number: 32257 Subject: Ultrix lint & cc understand void, prototypes Date: Thu, 19 Mar 92 11:59:01 PST From: Mike Kupfer <kupfer> The Ultrix lint installed on dill understands void pointers and function prototypes, as does the C compiler. Bob already installed the C compiler on Sprite. If nobody else gets to it before I do, I'll install the new lint, but don't hold your breath. :-) (The Ultrix cc doesn't seem to have an equivalent to -Wall, so it's probably a good idea to keep using lint.) Do we want to change <cfuncproto.h> to enable function prototypes and void pointers for DECstations? mike Log-Number: 32260 Subject: Re: printer problem Date: Fri, 20 Mar 92 10:21:45 PST From: Mike Kupfer <kupfer> I restarted the daemon and that unclogged the queue. Bob, this is something you could do. If lpq on Sprite claims to be waiting for shallot, but lpq on shallot looks normal ("no entries"), then try (on Sprite) 1. su'ing to root 2. type "lpc restart lw533", which kills off the old printer daemon and starts a new one. If that doesn't fix things, then it's time for mail to bugs. mike Log-Number: 32263 Subject: scvs: more paranoia and better signal handling Date: Sat, 21 Mar 92 21:48:16 PST From: Mike Kupfer <kupfer> I occasionally cd to /sprite/src/kernel and forget to cd to my subtree before doing "scvs co mumble". Is this ever legal, or must one always be in a private subdirectory? Even if it's legal, I think we should make it hard to do, because cvs will cheerfully go in and make any files the user owns user- and group-writable (this is in /sprite/src/kernel/module_name, mind you). Lord only knows what other mischief it causes. Also, once you discover that you just Screwed Up, it's hard to kill scvs off. You hit control-C (and again and again and again) and scvs just keeps chugging along. mike Log-Number: 32265 Date: Sun, 22 Mar 92 16:43:03 PST From: gunter (Michial Gunter) Subject: Emacs on Sprite The emacs on sprite does not seem to appropriately handle the signals sent from the keyboard (via C-c C-c or C-c C-z) to a program being debugged by gdb. I recall seeing that a later version of emacs had a fix I remember to have been concerned with this issue. If this is the case, I would be willing to help with the reinstallation of emacs on Sprite, if that is deemed appropriate/useful. thanks, mike Log-Number: 32266 Subject: Re: Emacs on Sprite Date: Sun, 22 Mar 92 18:43:00 PST From: Mike Kupfer <kupfer> Emacs on Sprite has a general problem sending signals to subshells (e.g., C-c C-c doesn't work in shell mode, either). I think this is a Sprite problem rather than an Emacs problem, because I run Emacs 18.54 on both Sprite and Mach, and Sprite has the problem whereas Mach doesn't. On the other hand, I didn't build Emacs for my Mach system, so maybe it has a patch in it. So, if you have some diffs that you think solve the problem, let me know, and I'll try applying them. If you are volunteering to install a new version of Emacs from scratch, you should first (a) read the Sprite Engineering Manual, so that you understand Sprite source organization (you can get a copy from 533 Evans), and (b) look at the RCS history for the currently installed Emacs (in /local/emacs/src/cmds/emacs), so that you get a feel for the changes that need to be made, e.g., to support Sprite pseudo-devices. If after doing these two things you are still interested in installing a new Emacs, please come talk to me, so that I can convince you to install Epoch at the same time. :-) mike Log-Number: 32267 Subject: names for UNIX-compat signals? Date: Tue, 24 Mar 92 21:26:39 PST From: Mike Kupfer <kupfer> There are a handful of UNIX signals for which Sprite doesn't have an equivalent signal. Among them are: SIGIOT, SIGEMT, SIGIO, SIGWINCH, SIGUSR1, and SIGUSR2. The code that converts between Sprite and UNIX signal numbers maps these UNIX signals into the integers 26 through 31 (the remaining ones are all mapped to 0). I assume that these numbers were never given names because of the long-standing plan to use UNIX signal numbers for everything. If this conversion is not likely to happen any time soon, though, I would like to assign names to these signal numbers. I need to use them inside the kernel (to get the default signal behavior right) and would much prefer to use a name than a bare integer. My plan is to use the following names: UNIX proposed Sprite name Sprite number ----- ----- ----- SIGIOT SIG_IOT 28 SIGEMT SIG_EMT 29 SIGIO SIG_IO_READY 26 SIGWINCH SIG_WINDOW_CHANGE 27 SIGUSR1 SIG_USER1 30 SIGUSR2 SIG_USER2 31 If you think that assigning names to these signals is a mistake, or if you have an alternate suggestion for a name, please send me mail soon. mike Log-Number: 32268 Subject: stderr buffering inconsistent BSD, SunOS Date: Wed, 25 Mar 92 14:21:06 PST From: Mike Kupfer <kupfer> The Sprite stdio package buffers writes to stderr, which means that if you run the program below, you have to let it run for awhile before you see anything. If you run the program on okeeffe or ginger, you start getting output immediately. I guess this is primarily an issue for programs that write debugging messages to stderr and don't end each message with a newline. mike -- #include <stdio.h> main() { while (1) { fprintf(stderr, "."); sleep(2); } } Log-Number: 32269 Subject: possible explanation for Error code 16 Date: Wed, 25 Mar 92 14:41:32 PST From: Mike Kupfer <kupfer> I've been trying to figure out what causes the random Error code 16's when running pmake. Here's what I think is happening. If a migrated process does a Proc_Wait(), that turns into an RPC to the home machine. On the home machine this is implemented by Proc_RpcRemoteWait. Proc_RpcRemoteWait verifies that the process doing the waiting is, in fact, migrated and returns PROC_NO_PEER if it isn't. Well, I recently got a message in my syslog saying that Proc_RpcRemoteWait had found a READY, not MIGRATED, process, and sure enough, there was a new "error code 16" in my pmake log. Examination of Proc_MigrateTrap shows a large chunk of code between (a) the point where the process is started up on the remote host and (b) the point where the process context switches to the MIGRATED state. The process is unlocked during parts of this code, and the MIGRATING flag is cleared long before the context switch. So, I think what's happening is a process gets evicted and then remigrated. When it resumes on the remote machine it quickly does a Proc_Wait. Proc_RpcRemoteWait is called on the home machine before the home process has context switched, and the result is you get this error. I think the basic fix for this is to lock the process before context switching, using Proc_UnlockAndSwitch to do the context switch, and to delay clearing the MIGRATING flag until the final Proc_Lock before the context switch. The only question is what other changes should be made to Proc_MigrateTrap because of the new place for clearing the MIGRATING flag. For example, maybe the call to ProcMigWakeupWaiters should get moved. Will there be any problems if it's called while the migrating process is locked? How does clearing the MIGRATING flag interact with the MIG_ERROR flag? mike Log-Number: 32273 Date: Fri, 27 Mar 92 17:57:56 PST From: voelker (Geoffrey M. Voelker) Subject: `<file> not found' with ls The LFS filesystem on a local disk on arson became buggy when arson went into the debugger. Before it crashed, I had created a directory called `/t/t1/blah'. When I do an `ls', ls reports that it can't find `blah'. When I do a `mkdir /t/t1/blah', it creates another shadow `blah' directory and ls reports not finding multiple instances of `/t/t1/blah'. (Currently there are three shadow directories called `blah'). `rmdir' can't find the directory either. I believe I remember seeing a bug report to this effect earlier when there was the deluge of LFS problems. -geoff p.s. I'm sorry if my directory names are so, well, blah. (It had to be said.) Log-Number: 32274 Date: Sat, 28 Mar 92 20:23:39 PST From: mottsmth (Jim Mott-Smith) Subject: /usr/sww/X11R5/bin/xfig Under compatibility, xfig dies on sparcs with ld.so: text write-enable error (22) for main_$main_ and Page out of range in the syslog. The Decstation version seems to work. -- Jim M-S Log-Number: 32276 Subject: netroute: documentation, naming Date: Tue, 31 Mar 92 15:31:26 PST From: Mike Kupfer <kupfer> /boot/bootcmds still refers to both netroute and netroute.new. The man page for netroute says nothing about -v. mike Log-Number: 32279 Subject: LFS structures getting smashed? Date: Tue, 31 Mar 92 22:01:23 PST From: Mike Kupfer <kupfer> I've now seen the following problem on arson and oregano. Arson was running a private kernel of Geoff's, while oregano was running a private kernel of mine. The problem is that some process calls Lfs_StartWriteBack and hangs on the monitor lock. Unfortunately, it's holding the fscache monitor lock (fscacheBlocks.c), and eventually processes start piling up. The reason the process hangs on the lfs monitor lock seems to be that memory is getting trashed. Gdb reports values like (gdb) print $5.cacheBackendLock $6 = {inUse = -2129645736, waiting = 1, name = 0x3ffc8020 ERROR: invalid read address 0x3ffc8020 "", holderPC = 0x96488020 ERROR: invalid read address 0x96488020 "", holderPCBPtr = 0x96000208} When I first saw this on Arson I assumed it was gdb lying to me again. Now I'm beginning to think that it's telling the truth and that somebody is stomping on the LFS data structures. Has anyone else seen this? mike Log-Number: 32284 Subject: Re: LFS structures getting smashed? Date: Thu, 02 Apr 92 13:30:26 -0500 From: Fred Douglis <douglis@MITL.COM> I have been playing with some memory-intensive stuff on sprite and I found two interesting problems: one is that the machine would deadlock in the way you described, and the other is that at some point LFS would stop booting because it would get garbage in the summary block at startup. The second certainly suggests that LFS is getting trashed, and perhaps needs to do more sanity checks when it writes its checkpoints. The first, though, I thought was a real deadlock, and when I glanced through old sprite log messages on the subject it seemed this symptom had come up a lot before. (Pete had complained about it fairly recently, I think.) As for gdb, on the decstations here it gives me garbage so often that I've wound up resorting to printfs more often than not. Fred Log-Number: 32280 From: mgbaker (Mary Gray Baker) Subject: mail file corrupted Date: Wed, 01 Apr 92 11:09:03 PST /usr/spool/mail/mgbaker had garbage in it this morning. (Well, garbage of a more extreme variety than it usually contains.) It started out with something that went on for a long ways saying stuff like: 2768 0 0 0 16 0 325 4 0 0 Proc_GetPCBInfo 2 3 334 36 0 0 Sig_Send 4 0 0 14 0 0 4 0 0 16 0 349 4 0 0 Sig_Send 2 2 355 36 0 0 Sig_SetHoldMask 4 0 0 20 0 0 16 0 369 4 0 0 Sig_SetHoldMask 2 2 375 36 0 0 Sys_GetMachineInfo 4 0 0 18 0 0 16 0 390 4 0 0 Sys_GetMachineInfo 2 2 396 36 0 0 Sys_Shutdown 4 0 0 18 0 0 16 0 409 4 0 0 Sys_Shutdown 2 Log-Number: 32281 Date: Wed, 1 Apr 92 13:51:55 PST From: mottsmth (Jim Mott-Smith) Subject: Ghostview dies in compatibility mode Ghostview dies with a bus error consistently when run from sww. -- Jim M-S Log-Number: 32282 From: mgbaker (Mary Gray Baker) Subject: Unable to fetch handle for cleaning Date: Wed, 01 Apr 92 15:02:45 PST Allspice hung for a long time (many minutes) cleaning /swap1. It was printing repeatedly to its console: Can't fetch handle for file 54483 for cleaning It finally finished cleaning swap1, so maybe this isn't a problem. Sure took awhile, though. Mary Log-Number: 32283 Date: Wed, 1 Apr 92 16:53:00 PST From: elm (ethan miller) Subject: CRC/framing errors on sparc2 I'm running FrameMaker on my workstation (terrorism), and I seem to get an ethernet framing error & CRC error with every character I type. The FrameMaker program is actually running on joyride, but the display being used is terrorism's. Any idea what's causing these problems? As far as I can tell, there are no additional side effects (ie, the program seems to work fine). ethan Log-Number: 32284 Subject: Re: LFS structures getting smashed? Date: Thu, 02 Apr 92 13:30:26 -0500 From: Fred Douglis <douglis@MITL.COM> I have been playing with some memory-intensive stuff on sprite and I found two interesting problems: one is that the machine would deadlock in the way you described, and the other is that at some point LFS would stop booting because it would get garbage in the summary block at startup. The second certainly suggests that LFS is getting trashed, and perhaps needs to do more sanity checks when it writes its checkpoints. The first, though, I thought was a real deadlock, and when I glanced through old sprite log messages on the subject it seemed this symptom had come up a lot before. (Pete had complained about it fairly recently, I think.) As for gdb, on the decstations here it gives me garbage so often that I've wound up resorting to printfs more often than not. Fred Log-Number: 32285 Date: Thu, 2 Apr 92 12:16:55 PST From: shirriff (Ken Shirriff) Subject: lpd out of control The lpd on hijack (ds5000) was out of control, printing zillions of <51>Apr 2 12:14:53 lpd[31f14]: accept: stale remote file handle I did a kill -DEBUG, but the process disappeared. Ken Log-Number: 32286 Subject: dangling symbolic links in /sprite/src Date: Thu, 02 Apr 92 18:15:41 PST From: Mike Kupfer <kupfer> I started a large "find" in /sprite/src to look for references to certain UNIX signals. I've been finding quite a few symbolic links that point off into nowhere. In addition to links in admin/fsmake and admin/fsinstall (which I've already mentioned to John H.), there's benchmarks/pipe: printStats.c -> /sprite/src/cmds/bench/printStats.c boot/xyDiskBoot: fs.h -> ../generic/fs.h dev.c -> ../generic/dev.c devConfig.c -> ../generic/devConfig.c fsOpTable.c -> ../generic/fsOpTable.c fsOpTable.h -> ../generic/fsOpTable.h string.c -> ../generic/string.c boot/scsiTapeBoot: dev.c -> ../generic/dev.c devConfig.c -> ../generic/devConfig.c string.c -> ../generic/string.c fsOpTable.c -> ../generic/fsOpTable.c fsOpTable.h -> ../generic/fsOpTable.h boot/scsiDiskBoot.old: fsDisk.h -> /sprite/src/kernel/fs/fsDisk.h cmds/kgdb.sun3.new: initialized_all_files.c -> gdb/initialized_all_files.c kernel/fs.all: main.c -> ../fs/sizeof/main.c pfs.h -> ../fspdev/dev.old/pfs.h lib/include/g++/sys: fcntl.h -> /sprite/src/lib/g++/g++-include/sys/fcntl.h lib/fig2dev: fig2dev.h -> /X11/R4/src/cmds/fig2dev/dist/fig2dev.h genbox.c -> /X11/R4/src/cmds/fig2dev/dist/dev/genbox.c genepic.c -> /X11/R4/src/cmds/fig2dev/dist/dev/genepic.c genlatex.c -> /X11/R4/src/cmds/fig2dev/dist/dev/genlatex.c genpic.c -> /X11/R4/src/cmds/fig2dev/dist/dev/genpic.c genpictex.c -> /X11/R4/src/cmds/fig2dev/dist/dev/genpictex.c genps.c -> /X11/R4/src/cmds/fig2dev/dist/dev/genps.c genpstex.c -> /X11/R4/src/cmds/fig2dev/dist/dev/genpstex.c gentextyl.c -> /X11/R4/src/cmds/fig2dev/dist/dev/gentextyl.c gentpic.c -> /X11/R4/src/cmds/fig2dev/dist/dev/gentpic.c object.h -> /X11/R4/src/cmds/fig2dev/dist/object.h pi.h -> /X11/R4/src/cmds/fig2dev/dist/pi.h picfonts.h -> /X11/R4/src/cmds/fig2dev/dist/dev/picfonts.h psfonts.h -> /X11/R4/src/cmds/fig2dev/dist/dev/psfonts.h texfonts.c -> /X11/R4/src/cmds/fig2dev/dist/dev/texfonts.c texfonts.h -> /X11/R4/src/cmds/fig2dev/dist/dev/texfonts.h tpicfonts.h -> /X11/R4/src/cmds/fig2dev/dist/dev/tpicfonts.h Also, /sprite/src/lib/m/fmod.c is mode 600, and /sprite/src/lib/tk/dist/ks_names.h is mode 400. Shouldn't these files be at least group-readable? mike [30-Apr-92: we punted on the kgdb link (do this the next time the gdb sources are rationalized) and the g++ link (let the g++ users worry about it). Jim will look at the fig2dev stuff. -mdk ] Log-Number: 32290 Subject: allspice rebooted after hanging Date: Fri, 03 Apr 92 14:05:06 PST From: Mike Kupfer <kupfer> In an ironic twist, allspice got hung on a reopen RPC to king. Rob rebooted king, and allspice went through recovery with aix, but the reopen RPC remained hung. I took a core file (/home/ginger/cores/allspice.hang.king), which I will look at, and rebooted. Allspice was running the 1.111 kernel. mike Log-Number: 32291 Subject: more mysterious "disk full" messages Date: Fri, 03 Apr 92 15:00:19 PST From: Mike Kupfer <kupfer> Covet and anarchy (running 1.111 and the Sprite server, respectively) had problems within the past hour with messages like 4/3/92 15:38:25 allspice (14) RmtFile "/sprite/admin/loginFailures" <10,82848> Write-back failed: out of disk space<40008> despite the fact that "df" on allspice and on clients showed over 60MB free on /. The first time covet had this problem (a half hour ago, with /sprite/admin/dump/dumplog), I ended up rebooting it. The most recent incident ended on both covet and anarchy with 4/3/92 15:38:50 allspice (14) RmtFile "/sprite/admin/loginFailures" <10,82848> Write-back failed: stale handle The syslog for allspice shows ClientCommand, return-attrs msg to client 89 file "spritehosts" <10,90619> failed 3000a ClientCommand, return-attrs msg to client 89 file "spritehosts" <10,90619> failed 3000a Fri Apr 3 14:30:00 PST 1992 Fsconsist_Close, ".fscheck.out" <4,4>: client 88 not last writer 14, was cached ConsistTimeout (1 minutes) client 88 write-back & invalidate file <10,82848> "loginFailures" Client state killed: 0 refs 0 write 0 exec FsrmtFileVerify: "loginFailures" <10,82848> client 88 not found Fsrmt_RpcWrite, stale handle <10,82848> client 88 ConsistTimeout (1 minutes) client 89 write-back file <10,82848> "loginFailures" Client state killed: 0 refs 0 write 0 exec FsrmtFileVerify: "loginFailures" <10,82848> client 89 not found Fsrmt_RpcWrite, stale handle <10,82848> client 89 Client 88 is covet, client 89 is anarchy. Error 3000a is RPC_SERVICE_DISABLED. Similar "failed return-attrs" messages appear in allspice's syslog dealing with raid1 and lust. Note that the time (15:38) in the messages is an hour off. (Amazing that nobody has submitted a bug report on that yet this Spring. Either nobody's noticed, or we've finally got our users trained to ignore it. :-)) Is it possible that there is some sort of consistency problem where the client gets back an erroneous "disk full" status? mike Log-Number: 32293 Subject: more on mysterious "disk full" messages Date: Fri, 03 Apr 92 18:31:42 PST From: Mike Kupfer <kupfer> I started looking at the core file for the hang between allspice and king, and one of the first things I stumbled onto was fsconsistCache.c:EndConsistency(). It seems to assume that a consistency action will fail only because the disk is full (or the client is dead). Aren't there other failure modes besides these two? mike Log-Number: 32294 From: mgbaker (Mary Gray Baker) Subject: Re: more on mysterious "disk full" messages Date: Fri, 03 Apr 92 18:29:46 PST John was talking about this earlier today. Consistency can also fail if a server does consistency to another machine, but can't get that machine's reply because the server hasn't yet turned on its ability to accept RPCs from other machines. This can happen during boot. Mary Log-Number: 32295 Subject: Re: more on mysterious "disk full" messages Date: Fri, 03 Apr 92 18:46:10 PST From: Mike Kupfer <kupfer> Are write conflicts another possibility? The first time today that covet was getting bogus "disk full" messages it was (I think) trying to write back /sprite/admin/dump/dumplog, which I had mistakenly tried to edit from sage while covet was doing a "dump -t". Poking around in the old allspice syslog I see Fscache_Write: Alloc failed <10,10> "dumplog" DISK FULL Fsconsist_Close, "jhh" <10,2408>: client 83 not last writer 14, was cached ConsistTimeout (1 minutes) client 88 write-back & invalidate file <10,72821> "dumplog" Client state killed: 1 refs 1 write 0 exec FsrmtFileVerify: "dumplog" <10,72821> client 88 not found Fsrmt_RpcWrite, stale handle <10,72821> client 88 FsReopenHandle: file "dumplog" <10,72821>: client 88 has dirty blocks, but client 33 is using mike Log-Number: 32292 From: mgbaker (Mary Gray Baker) Subject: Recovery problems Date: Fri, 03 Apr 92 17:46:45 PST Machines are waiting for recovery with the server, and when the server reboots they are not getting their processes kicked alive again. I will put some logging in the recovery code to see what possible path through the code it could be taking that it escapes this vital operation. But I will probably not do it for another week, since I need to finish this paper first. Mary Log-Number: 32299 Subject: function prototypes & mips cc: I spoke too soon Date: Sun, 05 Apr 92 12:51:13 PDT From: Mike Kupfer <kupfer> I tried enabling function prototypes and compiling part of the Sprite server on a DECstation. It didn't work. Apparently the MIPS compiler gets confused if a function pointer (e.g., in a struct definition) uses prototypes and (a) the prototype has multiple arguments and (b) some argument other than the first uses a typedef. In particular, the compiler gags on the definition of writeProc in struct _file in stdio.h. So, my vote is to leave cfuncproto.h alone until MIPS fixes its compiler. mike Log-Number: 32306 Subject: Re: Cleaning messages Date: Wed, 08 Apr 92 12:50:36 -0700 From: mendel@lagunita.stanford.edu > > > Allspice's console has a lot of messages > of the form: > Can't fetch handle for file xxxxx for cleaning > where xxxxx is a 5 digit number. > > What does this mean? > > -- Jim M-S During cleaning LFS reads in the segments being cleaned and identifies the live contents and brings them into the file cache. In order to bring a block into the file cache LFS needs to fetch and/or create a local file system file handle (Fsio_FileIOHandle) for the file. It is possible that the cleaning code finds the file handle locked when the fetch occurs. Since blocking the segment cleaner can lead to deadlock the code has no choice but to skip over the file. The current code also prints the above message. The 5 digit number is file number (i-number) of the file being skipped over. This message was more informative when there was only one LFS file system. Without knowning the file system it is kinda hard to tell what is going on. Note that if the segment cleaner skips over a file while cleaning the segment can not mark as clean. Part of the reason for the message was to inform me that this had occured and the system was doing work without generating clean segments. This message appears much more frequently than I anticipated. Mendel Log-Number: 32312 Date: Wed, 8 Apr 92 18:54:56 -0700 From: dlong@cats.UCSC.EDU (Dean R. E. Long) Subject: Re: lfs and swap1 > From kupfer@allspice.Berkeley.EDU Wed Apr 8 18:27:44 1992 > To: bugs@allspice.Berkeley.EDU > Subject: lfs and swap1 > Date: Wed, 08 Apr 92 18:27:44 PDT > From: Mike Kupfer <kupfer@allspice.Berkeley.EDU> > > Well, we just had another multi-minute pause while allspice cleaned > /swap1. Maybe we should start thinking about making /swap1 be an OFS > instead of an LFS. > > mike > Maybe changing numSegsToClean in the superblock from 100 to something smaller would make the pauses bearable. The lfschkpt program can be used to change that value. I find it comes in handy on my machine, since I have a small amount of memory. dl Log-Number: 32313 Subject: bogus name in ClientCommand printf Date: Wed, 08 Apr 92 23:17:56 PDT From: Mike Kupfer <kupfer> When I restart the Sprite server on anarchy, I frequently see messages in allspice's syslog like ClientCommand, return-attrs msg to client 89 file ",RCSt3480077" <10,90523> failed 3000a The name is bogus. It's really complaining about /etc/spritehosts. Perhaps the name it's using is the name of the RCS temporary file that was created the last time spritehosts was changed? The link count on spritehosts is 1, so that's no excuse. mike Log-Number: 32318 Subject: consistency problem: covet and allspice (really LFS problem?) Date: Fri, 10 Apr 92 18:08:39 PDT From: Mike Kupfer <kupfer> >From covet's syslog: RpcDoCall: <write> RPC to allspice is hung Fri Apr 10 17:30:01 PDT 1992 <write> RPC ok 4/10/92 17:31:38 allspice (14) RmtFile "ds5000.md/lfsBlockIO.o" <2,106381> Write-back failed: stale handle 4/10/92 17:31:38 allspice (14) - recovering handles Fsprefix_OpenCheck waiting for recovery 4/10/92 17:31:51 allspice (14) Recovery complete 737 handles reopened 241 failed reopens >From allspice's syslog: /user2: Cleaning started - deficit 95 segs ConsistTimeout (1 minutes) client 88 write-back file <2,106381> "lfsBlockIO.o" Client state killed: 0 refs 0 write 0 exec Fri Apr 10 17:30:01 PDT 1992 ConsistTimeout (1 minutes) client 75 write-back file <2,66858> "fscacheBlocks.o" Client state killed: 0 refs 0 write 0 exec /user2: Cleaned 160 segments in 40 segments /sprite/src/kernel: Cleaning started - deficit 60 segs /sprite/src/kernel: Cleaned 66 segments in 6 segments /sprite/src/kernel: Cleaning started - deficit 1 segs FsrmtFileVerify: "lfsBlockIO.o" <2,106381> client 88 not found Fsrmt_RpcWrite, stale handle <2,106381> client 88 FsrmtFileVerify: "fscacheBlocks.o" <2,66858> client 75 not found Fsrmt_RpcWrite, stale handle <2,66858> client 75 4/10/92 17:31:37 covet (88) initiating recovery Fscache_BlockRead: Giving zeros to "lfsBlockIO.o" <2,106381> block 17 amount 88031 Fscache_BlockRead: Giving zeros to "fscacheBlocks.o" <2,66858> block 6 amount 74208 (plus lots lines about more giving zeros to the two files) Allspice is running the cmtice kernel. Covet is running 1.111. The other ConsistTimeout was with tyranny, which I ended up rebooting because it was hanging migd requests. The reason I noticed this is that ld eventually gagged when it tried to make the corresponding module .o file. So was covet trying to write to /user2, which then hung all writes to allspice, causing the write-back to fail? mike Log-Number: 32320 Subject: munged RCS files Date: Sun, 12 Apr 92 15:54:41 PDT From: Mike Kupfer <kupfer> /sprite/src/boot/diskBoot.OpenProm/RCS/{fsDisk,devConfig}.c,v are broken. The former is a makefile fragment, and the latter is a formatted man page. mike Log-Number: 32328 Subject: ds5000 vm usage bug Date: Wed, 15 Apr 92 17:56:40 -0400 From: Fred Douglis <douglis@MITL.COM> It seems that the decstations report VM usage improperly: they lose track of memory allocated via Vm_BootAlloc. This is because vmMemEnd is incremented, by Vm_BootAlloc, then reset to the start of "dynamic memory", and then later on the value of &end is used instead of the previous value of vmMemEnd. I wanted to fix this, so I made a quick patch to save vmMemEnd in another variable, but this variable is kept in the machine-independent area so the fix isn't terribly clean. (But then again, neither are all the places in VM that have #ifdef's depending on machine type and which promise to be temporary :-). Plus, I've made other changes since the last time I checked in these sources, so it would be hard to generate a patch for you. However, I'll do it if anyone cares to fix this bug in your sources and it's not immediately apparent how to do so. Fred Log-Number: 32329 Subject: fsstat confusion Date: Wed, 15 Apr 92 19:02:35 PDT From: Mike Kupfer <kupfer> Most of the Sprite server's file activity right now is from paging. This leads to some outrageous percentage numbers from fsstat. While tracking down these numbers, I'm having problems understanding just what fsstat is trying to tell me. The first WRITES number (Fs_BlockCacheStats.writeAccesses): this counter is maintained by Fscache_Write and appears to be the number of blocks that were dirtied. The first WRITETHRU number (Fs_BlockCacheStats.dataBlocksWrittenThru): this counter is maintained by Fscache_GetDirtyBlock and appears to be the number of blocks that were cleaned. Question #1. Why is the WRITETHRU number greater than the WRITES number on allspice, lust, and oregano? This doesn't hold for any of the clients I checked (oregano has /scratch1 on it, so it's not a simple client). When LFS calls Fscache_GetDirtyBlock, does that block get cleaned then, or can there be multiple Fscache_GetDirtyBlock calls for the same dirty block (e.g., descriptor)? Alternatively, is there some path in LFS that goes through Fscache_GetDirtyBlock but not Fscache_Write? The WRITETHRU vm number (vmBlocksWritten): this counter is maintained by VmPageServerWrite and appears to be the number of VM blocks that were cleaned (i.e., actually written out to disk or to a server). Question #2. What is the percentage figure that appears after the vm number? It is computed as vmBlocksWritten --------------- x 100 writeAccesses However, the VM traffic is not included in the (writeAccesses) cache statistics, so this computation doesn't yield a valid percentage. I suppose you could argue that it just represents a ratio of VM traffic to file traffic, but (a) wouldn't it make more sense to compare VM traffic with actual file system traffic (blocksWrittenThru) instead? And (b) I think it's dangerous to express a non-percentage on the same line with a bunch of numbers that (I think) *are* valid percentages. Byte traffic statistics: the percentage figures in the "Mb Read" and "Mb Write" are computed as foo ------------- x 100 cache traffic where "foo" is remote traffic, disk file traffic, or disk descriptor traffic. Question #3. Are these numbers supposed to be real percentages? Not all remote traffic goes through the cache (for example, covet reports that remote writes account for 107%; I guess it's been paging a lot). Does the "raw disk" (descriptor) traffic go through the cache? (Allspice has reported that the raw disk writes account for 105%, but maybe that's from the same bug in Question #1.) mike Log-Number: 32330 Subject: blocksPitched comment in fsStat.h Date: Wed, 15 Apr 92 20:28:12 PDT From: Mike Kupfer <kupfer> unsigned int blocksPitched; /* The number of blocks that were * thrown out at the command of * virtual memory. */ When I saw this I thought it had to do with the VM module asking the FS module for memory. What it really means is that the FS or VM module figured out that a block in the FS cache was duplicating a VM page, so the block got stuck at the front of the cache's LRU list, so that it would get reclaimed quickly. Does unsigned int blocksPitched; /* The number of blocks thrown * out because they duplicated * VM-managed blocks. */ sound okay to everyone? The man page for fsstats needs fixing, too. mike Log-Number: 32332 From: mgbaker (Mary Gray Baker) Subject: Recent allspice crashes Date: Thu, 16 Apr 92 17:55:31 PDT The recent allspice crashes appear to be due to a window underflow trap after all the register windows have been saved to the stack. To save all the windows to the stack, it calls FlushTheWindows(n) recursively for NUM_WINDOWS - 1 times. On unroling from this, it is getting a window underflow. I don't know any more than this yet, because most of the stack information gets blown away, but I'll investigate next time it happens. Alternatively, since this is in the experimental kernel being run by the 252 students, maybe we don't need to worry about it. Mary Log-Number: 32335 Subject: Re: More allspice crashes Date: Fri, 17 Apr 92 10:56:22 -0700 From: mendel@lagunita.stanford.edu > > Since Bob's message there have been two more Allspice crashes with the > same error (read from clean segment). I took a core dump of one of them > in /home/ginger/cores/allspice.4.17. Can someone take a look at this > ASAP to see what disk is failing? I think we need to take action to > clean up the offending disk or else Allspice is going to keep crashing. > > Also, I think we need to get out a new kernel that prints out the > name of the problem partition when errors like this occur, so we can > know immediately what disk is having problems. Yes. > > By the way, I rebooted the 1.112 kernel instead of cmtice, so that > we'd be able to debug from ginger. The core dump was made from the > 1.112 kernel. > -John- Executive summary: The problem has been fixed. The problem was a swap file from sage (/swap1/33/79) had a block in a segment (984) on /swap1 marked as clean. Sage was trying to BlockCopy this block which caused the read-from-clean-segement trap. Every time allspice rebooted sage would retry the request that caused the error. The problem somewhat fixed itself because sage appears to have died. I truncated the file with a (cp /dev/null /swap1/33/79) command fix the on disk structure. Note that this command has the effect on putting allspice into the debugger when it detects the delete from a clean segment. I was careful to have a window open to allspice with a "kmsg -c allspice" ready to do with I typed the command. The file /swap1/63/130, a swap file from sabotage, had a similar problem. A single block was in a clean segment (498). I fixed the on disk format in the same way as before. Looks like there is a problem with segment numbers containing 4, 9, and 8. It also may be related to the use of emacs on these machines (sage and sabotage). I can think of two possible causes of this problem. The first is that a segment is being marked as clean without the cleaning being run on it. This seems highly unlikely to me. The second is that the cleaning is not cleaning as well as it should. In other words, it is skipping over still alive blocks in a segment. I looked at the segments 498 and 984. Nothing looked unusual about the segments. File /swap1/33/79 had length 8192 and segment 984 contained only block 1 (4096-8192) of the file. Segment 498 was written with two blocks of /swap1/63/130, the block in error (34) and the first indirect block of the file. Something happen so these blocks either weren't detected by the segment cleaner or were not written out was part of segment cleaning. This is a nasty bug. It can account for most if not all of the errors seen. Mendel Log-Number: 32338 Subject: recovery deadlock, migd hang Date: Sun, 19 Apr 92 16:27:03 PDT From: Mike Kupfer <kupfer> When I came in this afternoon, rup claimed that a bunch of machines were down. This was a lie--the problem was that their migds were all hung trying to talk to the migd on sedition. The reason sedition was hung was that it had deadlocked itself trying to do recovery after allspice rebooted. There were a bunch of processes with a call stack that ended with #0 0xf600c5c0 in Mach_ContextSwitch () #1 0xf60b9b6c in SyncEventWaitInt (...) (...) #2 0xf60b8d5c in Sync_SlowWait (...) (...) #3 0xf60a4840 in Proc_Lock ( procPtr=(struct Proc_ControlBlock *) 0xf64726c8) (procTable.c line 408) #4 0xf609cd14 in Proc_WakeupAllProcesses () (procMisc.c line 988) #5 0xf606112c in Fsutil_Reopen (...) (...) There were all trying to lock pid 14443, which was a new RPC server that was still being created. ID wtd user kernel event state name 14443 0 [0, 0] [0, 0] ffffffff new Rpc_Server Its parent was hung waiting for swap to come back. That it, it was doing a Sync_Wait on swapDownCondition. #0 0xf600c5c0 in Mach_ContextSwitch () #1 0xf60b9b6c in SyncEventWaitInt (...) (...) #2 0xf60b8d5c in Sync_SlowWait (...) (...) #3 0xf60c7614 in DoPageAllocate ( virtAddrPtr=(struct Vm_VirtAddr *) 0xf8011c08, flags=1) (vmPage.c line 1006) #4 0xf60c7708 in VmPageAllocate ( virtAddrPtr=(struct Vm_VirtAddr *) 0xf8011c08, flags=1) (vmPage.c line 1048) #5 0xf60cdab4 in Vm_GetKernelStack (...) (...) #6 0xf600d9e4 in Mach_SetupNewState (...) (...) #7 0xf609753c in Proc_NewProc (...) (...) #8 0xf60ac018 in Rpc_CreateServer (...) (...) #9 0xf60abf10 in Rpc_Daemon (...) (...) #10 0xf60b5538 in Sched_StartKernProc (...) (...) Unfortunately, the broadcast on swapDownCondition is done (indirectly) by Fsutil_Reopen, after Proc_WakeupAllProcesses completes. So it seems like either (a) Proc_WakeupAllProcesses needs to be smarter about locked processes or (b) Fsutil_Reopen should call Vm_Recovery before it calls Proc_WakeupAllProcesses. mike Log-Number: 32374 Subject: Re: recovery deadlock, migd hang Date: Wed, 29 Apr 92 11:54:26 PDT From: Mike Kupfer <kupfer> > So it seems like either > > (a) Proc_WakeupAllProcesses needs to be smarter about locked > processes > > or > > (b) Fsutil_Reopen should call Vm_Recovery before it calls > Proc_WakeupAllProcesses. I forgot a third possibility, which is for Proc_NewProc to unlock the child's PCB earlier than it does now. [This is a followup to a bug report that will be in tomorrow's list, so don't worry if it doesn't make sense out of context.] mike Log-Number: 32339 Subject: makedepend and cross-compiling Date: Sun, 19 Apr 92 22:55:18 PDT From: Mike Kupfer <kupfer> Unfortunately, running makedepend is currently not a machine-independent activity. This is because of the links in /sprite/src/kernel/Include that point to $MACHINE.md/mumble. So user programs that include <kernel/foo.h> get the foo.h for the machine that makedepend is running on, not the target machine's foo.h. This causes problems if foo.h doesn't exist for all machine types. Example: devAddrs.h. These links were all put in in mid-November by Bob. Maybe they were put in for use with imake? mike Log-Number: 32342 Subject: /etc/exports, unfsd Date: Mon, 20 Apr 92 20:22:54 PDT From: Mike Kupfer <kupfer> Is anyone still mounting Sprite filesystems via unfsd? I ask for two reasons. One is that /etc/exports lists a slew of filesystems that no longer exist. Two is that /etc/exports exports the filesystems to the entire world. This is apparently one of the backdoors that the intruder has been using to crack into SunOS systems. I don't know if the same tricks can be used to break into Sprite, but it would probably be a good idea to close the door anyway. (Unfortunately, it's hard to tell what security mechanisms unfsd uses. There's little documentation, the code is obscure, and the test case I tried from a Postgres machine failed for reasons apparently unrelated to security checks.) So, option 1 is to turn off unfsd. Option 2 is to clean up /etc/exports and leave unfsd running. Votes, anyone? mike [30-Apr-92: Jim will fix up /etc/exports. -mdk] Log-Number: 32344 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 21 Apr 1992 14:39:18 PDT Subject: bash/compatibility problems When I try to run /usr/sww/bin/bash on a ds5000 I get the following messages on my console. The kernel I'm running is equivalent to ds5000.1.112. Sig_SigvecStub: bad signal 5 Sig_SigvecStub: bad signal 5 Sig_SigvecStub: bad signal 11 Sig_SigvecStub: bad signal 11 Sig_SigvecStub: bad signal 5 John Log-Number: 32346 Date: Wed, 22 Apr 92 14:56:36 PDT From: shirriff (Ken Shirriff) Subject: Allspice crash Allspice wedged up last night. It wouldn't respond to L1-anything, but it seemed to respond to pings for some reason. Since there wasn't anything that could be done with it, I rebooted it. I used the new kernel since I didn't see a note specifying a different one. Jim says he was using Jaquith heavily at the time of the crash, so that might be a cause. Log-Number: 32348 Subject: Re: Allspice watchdog reset Date: Thu, 23 Apr 92 11:06:09 -0700 From: mendel@lagunita.stanford.edu > > > Allspice panic'd and got a watchdog reset running the cmtice kernel. > There were no informative error messages on the console > so I rebooted with the cmtice kernel. > > -- Jim M-S One of the easiest ways to get a watchdog reset is to run off the end of a kernel stack. You might want to check the cmtice kernel for large objects (i.e. buffers) being allocated on the stack. Since disk tracing mods are made in routines near the bottom of the call tree, it is possible that they are trying to push one byte too many on to the kernel stack. Mendel Log-Number: 32351 Date: Thu, 23 Apr 92 16:04:17 PDT From: elm (ethan miller) Subject: ethernet packet problem? Whenever I run FrameMaker on joyride (xhosted to terrorism), I get a CRC error and/or framing error with each packet the FrameMaker application sends to terrorism's X server. This slows things down a great deal. Everything still works, but much more slowly. Is there any reason that the "LE ethernet: Received packet with CRC error" messages are printed out? It'd probably make things run much faster if these weren't sent to the syslog. Alternatively, does anyone want to find out why the sun4c kernel believes that there's a CRC error and/or framing error in most FrameMaker packets? This bug is definitely repeatable on my machine. It's happened every time I've used Frame. ethan Log-Number: 32352 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Thu, 23 Apr 1992 16:15:30 PDT Subject: Re: ethernet packet problem? As far as I've been able to tell the CRC errors and framing errors actually exist and are not figments of Sprite's imagination. SunOs et al. do not print out these messages, therefore our complaints to the powers-that-be go unheeded because "normal" machines don't show any symptoms. I think we should probably tone down the error messages, so they are printed only infrequently. John Log-Number: 32359 Subject: permissions problems, cruft in /sprite/src/kernel Date: Fri, 24 Apr 92 10:45:01 PDT From: Mike Kupfer <kupfer> sage-1# cd /sprite/src/kernel sage-2# ls -l dev/ds5000.md/devSCSIC90.c lfs/lfsStats.h \ main/ds3100.md/mainInit.c main/main.h main/ds5000.md/mainInit.c \ main/sun3.md/mainInit.c main/sun4.md/mainInit.c main/sun4.md/stub.c -rw-rw-r-- 1 root 41097 Mar 12 22:50 dev/ds5000.md/devSCSIC90.c -rw-rw-r-- 1 mgbaker 11098 Apr 1 15:35 lfs/lfsStats.h -rw-rw-r-- 1 jhh 12460 Nov 4 12:55 main/ds3100.md/mainInit.c -rw-rw-r-- 1 jhh 12635 Nov 4 12:55 main/ds5000.md/mainInit.c -rw-rw-r-- 1 jhh 1254 Nov 4 12:55 main/main.h -rw-rw-r-- 1 jhh 11546 Nov 4 12:55 main/sun3.md/mainInit.c -rw-rw-r-- 1 jhh 11567 Nov 4 12:55 main/sun4.md/mainInit.c -rw-rw-r-- 1 jhh 0 Nov 4 12:55 main/sun4.md/stub.c I assume I should delete stub.c and chmod the remaining files to 444. Any objections? mike Log-Number: 32360 Subject: lust reboot due to dementia Date: Fri, 24 Apr 92 11:09:05 PDT From: Mike Kupfer <kupfer> The nfsmount's on lust seemed to be screwed up this morning, and I couldn't rlogin in. When I checked at the console, there was a funny-looking root prompt, and the hostname command didn't work. I tried logging out so that I could log in as myself, but I never got a new login prompt. At that point I reset lust and rebooted it. mike Log-Number: 32361 Subject: yet another allspice hang Date: Fri, 24 Apr 92 11:12:20 PDT From: Mike Kupfer <kupfer> Allspice hung for no apparent reason. I rebooted with the new (1.112) kernel, under the assumption that the recent rash of hangs is related to the cmtice kernel. By the way, I tried putting allspice into the debugger from ginger, but I got told that -d wasn't a valid option for kmsg. mike [30-Apr-92: Mary will see if Dean Long has a kmsg that supports -d and runs on SunOS. -mdk] Log-Number: 32363 Date: Fri, 24 Apr 92 12:18:59 PDT From: voelker (Geoffrey M. Voelker) Subject: arson arson seems to be a little handicapped lately. I found it in the monitor, and someone had tried to reboot it but was getting ECC errors. I did an `init' and tried to reboot it myself, but got more ECC errors. Could this be just because of heat? 608-4 is warm, but not exceptionally so. -geoff [30-Apr-92: the workaround for this is to turn the machine off and wait a few minutes before turning it on again. -mdk] Log-Number: 32366 Date: Sat, 25 Apr 92 14:22:46 PDT From: shirriff (Ken Shirriff) Subject: Allspice wedged up on cleaning Allspice seemed to go into an infinite clean cycle on /swap1. it cleaned /swap1 for about 10 minutes, printed: FscacheGetDirtyFile skipping deleted file <0,27973> "16" for 5 minutes and then went back to cleaning /swap1. Since it wasn't making any progress I rebooted. Log-Number: 32369 Date: Sun, 26 Apr 92 12:12:15 PDT From: voelker (Geoffrey M. Voelker) Subject: lust Lust seemed to be wedged around noon today. There was a series of about 13 messages of the form: *** compat: unknown errno value 262144 Before I went in to check the machines I had noticed that piracy had a series of messages about someone failing to open `netroute.new'... possibly the two are related. I don't know why lust would have been wedged. I rebooted lust with the cmtice kernel. -geoff Log-Number: 32370 Subject: Re: lust Date: Sun, 26 Apr 92 12:52:05 PDT From: Mike Kupfer <kupfer> > Lust seemed to be wedged around noon today. There was a series of about > 13 messages of the form: > > *** compat: unknown errno value 262144 This error message comes from the routine that maps a UNIX errno value to a Sprite ReturnStatus (Compat_MapToSprite). It looks like somebody is passing in a value that is already a ReturnStatus (FS_NO_ACCESS). mike Log-Number: 32372 Subject: mysterious migd deaths Date: Sun, 26 Apr 92 22:22:49 PDT From: Mike Kupfer <kupfer> There seems to be an annoying problem lately where migd's get hung or die on a client and have to be manually restarted. When I came in this afternoon (around 1645), for example, I had to restart migd on 4 or 6 machines. The migd logs for the machines in question frequently ended with messages like Error 5 writing to global daemon: I/O error. Error 1 writing to global daemon: not owner. 2121e: terminated by order of global daemon... should be restarted soon. terminated by order of global daemon. or Error 5 writing to global daemon: I/O error. Error 5 writing to global daemon: I/O error. ContactGlobal: couldn't open /sprite/admin/migd/pdev: I/O error Error 5 writing to global daemon: I/O error. This host is being reclaimed by order of global migration daemon. This host is being reclaimed by order of global migration daemon. This host is being reclaimed by order of global migration daemon. This host is being reclaimed by order of global migration daemon. or Error 5 writing to global daemon: I/O error. Error 22 writing to global daemon: invalid argument. ContactGlobal: couldn't open /sprite/admin/migd/pdev: I/O error Error 5 writing to global daemon: I/O error. Error 5 writing to global daemon: I/O error. ContactGlobal: couldn't open /sprite/admin/migd/pdev: invalid argument ContactGlobal: couldn't open /sprite/admin/migd/pdev: invalid argument ContactGlobal: couldn't open /sprite/admin/migd/pdev: invalid argument ContactGlobal: couldn't open /sprite/admin/migd/pdev: invalid argument Migd_Init - Unable to contact master of global pdev: operation would block Exiting. mike Log-Number: 32373 Subject: allspice ran out of memory Date: Mon, 27 Apr 92 15:18:21 PDT From: Mike Kupfer <kupfer> /jaquith: Cleaning started - deficit 217 segs Fatal Error: VmMach_DMAAlloc: unable to satisfy request for 131072 bytes at 0xf66c3948 We rebooted the 1.112 (new) kernel. mike Log-Number: 32375 Subject: serious cleaning overload on allspice Date: Wed, 29 Apr 92 15:23:32 PDT From: Mike Kupfer <kupfer> Allspice got into a mode where it was continually cleaning /swap1. Excerpts from allspice's syslog are below. This appears to have been caused by a runaway IP server on arson that had ballooned up and was paging like crazy. mike [syslog excerpt deleted -mdk] Log-Number: 32376 Date: Wed, 29 Apr 92 16:04:51 PDT From: pmchen (Peter M. Chen) Subject: Re: serious cleaning overload on allspice Dunno if this is relevant to the /swap1 problems, but some random user "digres" on clove was doing thrashing clove. Here's the ps from clove: digres 4391c 1.7 28.6 53640 9364 READY 0:20 kimtables digres a3949 1.7 30.6110508 10032 READY 0:37 kimtables This was about 3:30pm. Pete Log-Number: 32378 Date: Sat, 2 May 92 18:44:40 PDT From: dlong (Dean R. E. Long) Subject: negative uid's "chown nobody filename" and "su nobody" don't seem to work. Some things think a uid is a short, while others think it's an int. Problably others think it's an unsigned short. dl Log-Number: 32395 Date: Fri, 8 May 92 12:54:44 PDT From: mottsmth (Jim Mott-Smith) Subject: pcs messed up /pcs, which had a hard error a few days back seems to be messed up. Trying to delete ~decman/access/shar.file hangs your xterm. -- Jim M-S Log-Number: 32414 Subject: hung RPCs due to /pcs Date: Thu, 14 May 92 15:23:49 PDT From: Mike Kupfer <kupfer> I'm beginning to suspect that many of the recent problems can be attributed to /pcs. I put sage into the debugger, and the reason it's got a hung RPC is that it's trying to do a write to /pcs. Rebooting lust doesn't completely fix the RPCs, because the clients will try to write back the same dirty /pcs blocks when lust comes back. I think the reason that random machines are being affected is that none of the people using /pcs actually runs Sprite these days, so they rlogin onto Sprite machines to do their dirty work. Also, a bunch of people's home directories are in /pcs, so if they get mail, the sendmails generate RPCs that hang. I suspect that over time things get more and more gummed up until something fails completely. I vote to let things go for right now, but to disable the prefix command for /pcs, so that when we eventually have to reboot lust, the problems will (I hope) go away. mike Log-Number: 32382 Date: Sun, 3 May 92 17:10:19 PDT From: shirriff (Ken Shirriff) Subject: tar failed during dumps: long name The weekly dumps hung due to the following: dumping /local skipping 1 files position = 2317401 execing tar ncbfTPL 128 - - successfully forked tar Assertion failed: (dp->d_namlen <= 255) line 69 of "readdir.c" Ken Log-Number: 32384 Date: Mon, 4 May 92 07:45:39 PDT From: bmiller (Bob Miller) Subject: Lust hung Lust was in the debugger this morning... TLB LS miss exception at PC 0x800a85a4 I rebooted. Bob Log-Number: 32385 Date: Mon, 4 May 92 07:58:52 PDT From: bmiller (Bob Miller) Subject: oops There was a typo in my message about Lust. Should have been TLB LD (not LS) miss exception... Bob Log-Number: 32388 From: mgbaker (Mary Gray Baker) Subject: Allspice and sendmail Date: Wed, 06 May 92 15:53:00 PDT John and I have been restarting all the servers on allspice repeatedly today. A lot of mail has not been getting through. Does anybody know what the problem is or why it got particularly bad today? If allspice appears to be down every time remote mailers contact it, mail may not get through for quite a while. Mary [28-May-92: the hypothesis is that allspice was suffering from the ftp load induced by a new Tk release. -mdk] Log-Number: 32389 Subject: lust crash: packet too large Date: Wed, 06 May 92 16:38:44 PDT From: Mike Kupfer <kupfer> Lust died with Fatal Error: OutputPacket: packet too large (4174) This message was preceded by a couple "Too many collisions" messages. mike Log-Number: 32391 Date: Thu, 7 May 92 08:05:48 PDT From: bmiller (Bob Miller) Subject: allspice Allspice was down when I came in this morning. It appeared to be a cleaning error on /swap1 (I just caught a brief look a the screen before the 'entering debugger' message scrolled it off). ...Interrupt Trap (16) exception at PC 0xf60d40ac I tried to take a core dump from dill, but kgcore gave me timed out and resend messages. I rebooted allspice, but it died again with: Fatal Error: LfsOkToRead read from clean segment I rebooted a second time. Bob Log-Number: 32393 Date: Thu, 7 May 92 09:44:50 PDT From: ouster (John Ousterhout) Subject: Migd problems Oops, sorry for the incomplete preceding message. When I came in this morning, pmake was having problems with migd: MigOpenPdev: Error opening pdev /sprite/admin/migd/pdev (still trying): I/O error. MigOpenPdev: Unable to contact daemon. I thought that if I deleted the pdev then a new migration daemon would automatically start up, but it just caused a different error message: MigOpenPdev: Error opening pdev /sprite/admin/migd/pdev (still trying): no such file or directory. MigOpenPdev: Unable to contact daemon. Does anyone know what's going on here? I thought that new migration daemons were supposed to get created automatically when old ones die or become unreachable, but this doesn't seem to be happening. The migration situation is still goofed up. -John- Log-Number: 32394 Subject: Re: Migd problems Date: Thu, 07 May 92 10:18:50 PDT From: Mike Kupfer <kupfer> What seems to have happened is that the global master was running on clove and got stuck. Apparently after John removed the pdev the other migd's kept trying to talk to the master on clove; I don't know why. I didn't see anything in clove's migd log or in the global log to indicate why things got stuck in the first place. "rpcstat -chan" on arson showed a channel to clove that was "busy input"--maybe the timeout processing for the RPC got fouled up? Anyway, I killed and restarted the migd on clove, and this seemed to free up everyone except allspice, so I killed and restarted its migd as well. mike Log-Number: 32397 Date: Fri, 8 May 92 21:06:25 PDT From: shirriff (Ken Shirriff) Subject: Why / ran out of space The ip server on sabotage created a 95 MB error file in /hosts/sabotage/ip.out. I kmsg -d'd sabotage and got rid of the file. Ken Log-Number: 32398 Date: Fri, 8 May 92 23:39:59 PDT From: voelker (Geoffrey M. Voelker) Subject: Lust went south When I came back from dinner at around 11:15, lust seemed to be going in circles. Allspice was recovering handles and failing repeatedly from lust's point of view on it's console, and the rest of the world showed lust trying to recover handles and failing. I could ping lust, but I could not rlogin into it or get a prompt at its console. So I rebooted it and things seemed to have fixed themselves. Lust's console was filled with allspice and loiter trying to recover handles and failing, and with RPC broadcast timeouts. Covet's syslog showed a long series of `lust RPC timeouts' and `failed recovery' with a recovery done of 30002 which, looking in /usr/include/status.h, looks like a RPC_TIMEOUT. -geoff Log-Number: 32403 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 11 May 1992 22:03:36 PDT Subject: Re: "expect" script to monitor ipServer? I have the source to expect and now and then I work on porting it to Sprite. We don't have ptys. I have a hacked-up pty implementation that doesn't seem to work yet. I'll keep working on it during my copious spare time but don't expect anything soon (no pun intended). John Log-Number: 32405 Subject: Re: more mysterious pmake hangs Date: Tue, 12 May 92 14:22:44 PDT From: Mike Kupfer <kupfer> I found the culprit for the hangs: clove was failing the exec's of sh, and apparently this information wasn't propagating back to pmake. There was a repeating pattern of lines in clove's syslog that looked like Fsprefix_OpenCheck waiting for recovery Fsprefix_OpenCheck ok open of "/sprite/cmds/sh" waiting for recovery Remote exec of /sprite/cmds/sh failed: the system call was aborted by a signal According to the code (ProcDoRemoteExec), the remote process exits at this point. I don't know why the exit information doesn't make it back to the parent. mike Log-Number: 32406 Date: Wed, 13 May 1992 02:49:35 -0700 From: "Dean R. E. Long" <dlong@cse.ucsc.edu> Subject: List_Remove panic (item's pointers are invalid) We've been getting this particular panic quite a bit lately: in DeleteBlock() of fscacheBlocks.c, List_Remove(&blockPtr->fileLinks); dl Log-Number: 32407 Date: Wed, 13 May 1992 02:51:46 -0700 From: "Dean R. E. Long" <dlong@cse.ucsc.edu> Subject: List_Remove (cont) I almost forgot. We are running the sun4c.1.112 kernel. The last time the panic happened, I was untarring a file onto an LFS filesystem. dl Log-Number: 32408 Date: Wed, 13 May 92 08:12:01 PDT From: bmiller (Bob Miller) Subject: Lust Lust was hung this morning (allspice seemed to be OK). I rebooted it and it couldn't find the server for "/". So, I rebooted allspice. Allspice's console had these messages: 'Reinit recv unit' and 'Intel: Spurious interrupt (2)' Bob Log-Number: 32422 Subject: Re: scrolling syslog window, allspice reboot Date: Fri, 15 May 92 14:56:45 PDT From: Mike Kupfer <kupfer> As usual, "df /" showed plenty of space. I looked through subversion's syslog, but none of the new printf's that I'd put in were there. I tried "tail /sprite/admin/migd/global-log". The last part of the file looked like part of a grant proposal. Unfortunately, all future commands that I typed at subversion hung, so I told Bob to reboot. I think at about this time the following messages appeared in allspice's syslog: ConsistTimeout (1 minutes) client 90 write-back & invalidate file <10,9620> "global-log" Client state killed: 1 refs 1 write 0 exec FsrmtFileVerify: "global-log" <10,9620> client 90 not found Fsrmt_RpcWrite, stale handle <10,9620> client 90 5/15/92 14:19:05 subversion (90) Dropping regular open during recovery <write> 5/15/92 14:21:50 subversion (90) RPC timed-out <30>May 15 14:21:54 migd[50e37]: Write to global daemon timed out. <close> 5/15/92 14:22:00 subversion (90) RPC timed-out 5/15/92 14:22:22 subversion (90) rebooted ClientCommand, return-attrs msg to client 90 file "spritehosts" <10,90598> failed 3000a <prefix> 5/15/92 14:22:36 broadcast (0) RPC timed-out ProcessConsist: write-back & invalidate request failed <40008> file "global-log" <10,9620> Consistency failed 40008 on <10,9620> I tried looking at global-log on both allspice and sage; in both cases the requests hung. Sage had hung RPCs to treason and allspice; allspice had hung RPCs to treason. I took at core dump of allspice. It's /home/ginger/cores/allspice.hang.migdlog. The kernel was 1.112. I rebooted the new kernel off of allspice's disk, which got me 1.112 again. Is the installation of new kernels in /allspiceA and /lustA done by hand or mechanically? mike Log-Number: 32411 Subject: litany of server hangs, full partitions, sendmail problems Date: Wed, 13 May 92 12:27:08 PDT From: Mike Kupfer <kupfer> Sprite was in a sorry state this morning. I don't know which of the following problems are related and which are independent. (1) covet was complaining that it couldn't write back its 45KB migd log file, despite the fact that there 64MB free on the root partition. My guess is that the root partition did fill up around 0800 this morning; there is a message in the old Allspice syslog Fscache_Write: Alloc failed <10,10> "maillog" DISK FULL followed eventually by ConsistTimeout (1 minutes) client 90 write-back & invalidate file <10,82837> "maillog" Client state killed: 1 refs 1 write 0 exec ConsistTimeout (1 minutes) client 88 write-back & invalidate file <10,9574> "covet.Berkeley.EDU.log" Client state killed: 1 refs 1 write 0 exec FsrmtFileVerify: "covet.Berkeley.EDU.log" <10,9574> client 88 not found Fsrmt_RpcWrite, stale handle <10,9574> client 88 5/13/92 8:45:25 covet (88) initiating recovery Host 90 is subversion (Bob's machine). Jim rebooted covet, and the write-back error messages in covet's syslog reappeared. I rebooted allspice (more on that below), and covet kept complaining. I rebooted covet a second time, and it started complaining again. Finally I deleted the migd log file and restarted covet's migd. *That* made the messages stop. (By the way, the old and new migd log file have the same i-number, if that makes any difference.) Poking through Allspice's syslog, I also see (from around 1100 this morning) ProcessConsist: write-back & invalidate request failed <40008> file "maillog" <10,82837> <consist> RPC exit 0x1 Consistency failed 40008 on <10,82837> ConsistTimeout (1 minutes) client 90 write-back & invalidate file <10,82837> "maillog" Client state killed: 2 refs 2 write 0 exec if that's of any help. (2) Allspice had 4 "open" RPC's to lust that were hung. Lust reported them as busy. I wasn't sure how to track down what the RPCs were, so I rebooted Allspice. After Allspice rebooted Lust still reported those 4 RPC's as busy, so I rebooted Lust. \whine{Are we ever going to get kgcore working for DECstations?} (3) At some point during all this Jim checked his mail and found random bits of garbage and mail addressed to people other than him. Did anyone else get their mail file trashed? (4) Before I rebooted allspice, its syslog had a bunch of messages <18>May 13 10:45:09 sendmail[10e80]: AA69248: SYSERR: SMTP-MAIL: cannot fork: invalid argument <18>May 13 10:50:45 sendmail[50e3a]: NOQUEUE: SYSERR: daemon: cannot fork: invalid argument plus a bunch of messages about problems cleaning /swap1 ("skipping deleted file" and "Can't fetch handle"). I thought this might have been a replay of the problem where the old Compat_MapCode mapped VM_NO_SEGMENTS to EINVAL, but the sun4 sendmail binary is from March, so it should be using the current version of Compat_MapCode. mike Log-Number: 32416 Subject: lust problems from earlier today Date: Thu, 14 May 92 15:43:08 PDT From: Mike Kupfer <kupfer> Just before noon allspice, lust, and sassafras were stuck in some sort of three-way dance. Allspice kept saying that it was recovering handles with lust and that recovery failed because of an RPC timeout. Lust kept failing a reopen of /sprite/admin/migd/pdev with sassafras. It would then say it was waiting for recovery on /sprite/admin/migd/pdev, complain that it had a stale handle for "/", and then go through recovery with allspice. Sassafras wasn't talking. When I put it into the debugger, it mumbled something about no disk space for the migd global-log. When I went back to look at lust, it was still doing the same dance, but with a different partner (sabotage, instead of sassafras). When I looked at lust right after the Sprite meeting, it had a bunch of messages that looked like the /sprite/admin/migd/pdev dance (with larceny, I think), but the messages ended abruptly in mid-message. Lust didn't respond to the console or to RPC's, so I rebooted it. When it came up, it couldn't talk to ginger. "ping 128.32.150.28" (ginger's IP address) got no answers from ginger, and the inverse incantation on ginger didn't work, either. However, both ginger and lust could talk to allspice okay. I called Mary to figure out what the next step should be, and when I finished talking to her, the problem had gone away. Of course, by this time the nfsmount's were all messed up, so I power-cycled lust (so that it would run through its self-tests), and except for /pcs, it now seems to be fairly content. mike Log-Number: 32417 Subject: RCS directories in kernel sources Date: Thu, 14 May 92 16:23:38 PDT From: Mike Kupfer <kupfer> There aren't supposed to be RCS directories in the kernel module directories, are there? (I suspect that mkmf is creating them.) mike -- dbg/RCS/ dev/RCS/ fs/RCS/ fscache/RCS/ fsconsist/RCS/ fsdm/RCS/ fsio/RCS/ fslcl/RCS/ fspdev/RCS/ fsprefix/RCS/ fsrmt/RCS/ fsutil/RCS/ lfs/RCS/ libc/RCS/ mach/RCS/ main/RCS/ mem/RCS/ net/RCS/ ofs/RCS/ prof/RCS/ raid.null/RCS/ raid/RCS/ recov/RCS/ rpc/RCS/ sched/RCS/ sig/RCS/ sync/RCS/ sys/RCS/ timer/RCS/ utils/RCS/ vm/RCS/ Log-Number: 32419 Subject: ds3100 cpp is ANSI; pmake not lintable Date: Thu, 14 May 92 20:56:53 PDT From: Mike Kupfer <kupfer> /sprite/cmds.ds3100/cpp is the GNU cpp. Shouldn't it be a link to the Ultrix cpp (/usr/lib/cmplrs/cc/cpp), so that "cc -E" and "cpp" give you the same results? The reason I discovered this is that pmake is not lintable with an ANSI cpp. This is because pmake uses cpp token concatentation, the syntax for which depends on whether you have an ANSI or non-ANSI cpp. pmake is smart enough to recognize that ANSI cpp is different, but lint defeats pmake by turning off __STDC__. Theoretically you should still be able to lint pmake on DECstations, except that lint invokes cpp directly, rather than using "cc -E". mike Log-Number: 32420 Subject: lust crash: output packet too big Date: Thu, 14 May 92 23:34:14 PDT From: Mike Kupfer <kupfer> Lust croaked, complaining it had gotten a too-big output packet. It was running the 1.112 kernel; I rebooted with 1.113. mike Log-Number: 32423 Date: Sun, 17 May 92 16:54:50 PDT From: shirriff (Ken Shirriff) Subject: dl477 woes The Dec laser printer in 477 repeatably hangs if you use "lprm" to remove a job from the print queue. The only way to get it working again is to reboot larceny, the host ds5000. I tracked the problem through the twisty maze of printer daemons, and the hang occurs in pscomm, where pscomm sends a ^T to the printer and then does a select to see if the printer has any status. When hung, the select fails. Checking in the kernel, the serial chip never generates an interrupt for an incoming character, causing the select to fail. I was unable to check if this problem is specific to ds5000s or occurs on Suns, due to a gender incompatibility in the sun4 serial compatibility. I was unable to check if this problem is specific to Dec laser printers because we don't have a working LaserWriter. So, until this gets fixed, if you change your mind about printing anything, don't use lprm. Ken Log-Number: 32433 Date: Fri, 22 May 92 13:32:05 PDT From: shirriff (Ken Shirriff) Subject: dl477 hanging problem fixed I fixed the problem with lprm hanging dl477. The problem was that lprm results in a TD_RAW_SHUTDOWN on the serial line, which seems to permanenetly shutdown the line. I commented this out in devDC7085.c for the printer ports and now dl477 doesn't hang, and lprm still works. Ken Log-Number: 32424 Subject: ds5000 register botch if error when copying in args? Date: Sun, 17 May 92 22:52:32 PDT From: Mike Kupfer <kupfer> If I understand the ds5000 system call code, if there is an error calling a MachFetch?Args routine, the routine will appear to return with a non-zero status. If this happens, the system call code in machAsm.s bails out by jumping to sysCallReturn. The code at sysCallReturn checks the PCB's specialHandling flag by indirecting through s1. Unfortunately, s1 was only set up if there was no error from fetching the args. If there was an error, it looks to me like the fetch indirects through garbage. mike Log-Number: 32429 Date: Wed, 20 May 92 16:32:20 PDT From: shirriff (Ken Shirriff) Subject: color printer problems I'm trying to print a large (500K) file on the color printer. After a while the printer says "RS232C ERROR" or "ENGINE CTRL ERR". The manual says: "The following error messages may appear on your printer display. Turn the power off and then on again. If the error message reappears, call your Digital service representative." I don't know if the messages mean our serial line messes up or some other problem. Ken Log-Number: 32432 Date: Thu, 21 May 92 13:11:39 PDT From: ouster (John Ousterhout) Subject: tyranny crash Tyranny also died this morning with the same "Fatal Error: FsCacheFileBlocks, bad block" error that other machines have been getting. What is this, an epidemic? -John- Log-Number: 32439 Date: Tue, 26 May 92 11:30:44 PDT From: pmchen (Peter M. Chen) Subject: mustard crash The message on the console was procPtr->vmPtr->numMakeAcc = 0 TLB LD miss exception at PC 0x800ab824 (Mustard was running 1.113, ds5000) Pete Log-Number: 32438 Date: Mon, 25 May 92 22:35:13 PDT From: shirriff (Ken Shirriff) Subject: Allspice, nameserver problems We seemed to run into the old "CSSG nameserver crashes and then things don't work" problem. At least, for some reason Sprite suddenly became unable to access most of the outside world: shallot, agate, okeeffe, ucbvax, etc. I'm assuming this problem will be cured tomorrow. Also, allspice later ended up in a deadlock so I rebooted. Other machines said: allspice (14) RPC timed-out Allspice was busy printing: "return-attrs msg to client xx file "spritehosts" failed." where xx was 60 (arson) and 1 (lust). Ken Log-Number: 32440 Subject: allspice reboot: /user6 filled up Date: Tue, 26 May 92 11:48:40 PDT From: Mike Kupfer <kupfer> /user6 filled up and crashed allspice. Bob tried rebooting it this morning and it kept crashing, so he gave up. When I came in, I tracked down the Responsible User and also one of the clients that was trying to do the writeback. I put the client (pepper) into the debugger and removed a directory that I had in /user6, freeing up 125KB or so. When I reboot allspice, /user6 filled up again, so I gunned all the machines that the Responsible User was logged in on. I then cleaned out the {admin,cmds,daemons}.*.old directories, freeing up 25MB or so. Strangely enough, at one point the message Fscache_Write: Alloc failed <10,10> "(no name)" DISK FULL appeared on allspice's console. Partition 10 is the root, which had over 40MB free at the time. This was immediately followed by ClientCommand, return-attrs msg to client 44 file "spritehosts" <10,90598> failed 3000a mike Log-Number: 32441 Subject: allspice reboot: timed-out RPCs; routing bug? Date: Tue, 26 May 92 16:33:50 PDT From: Mike Kupfer <kupfer> Allspice started timing out RPCs from Sage. When I looked at Allspice's console, it was printing out about once a second ClientCommand, return-attrs msg to client 68 file "spritehosts" <...> failed 30004 Host 68 is sedition. Sedition's syslog showed that Allspice was timing out its RPCs, too. I did a "netroute -p" on Allspice; sedition was not listed in the route table. I tried reloading the route table with "netroute -f /etc/spritehosts", but that had no apparent effect. (By the time I could type "netroute -p" again the route to sedition was gone, assuming that it had gotten reinstalled in the first place.) I didn't want to try installing sedition's route by hand, so I rebooted. Allspice and Sedition were running the 1.113 kernel. mike Log-Number: 32444 Subject: more on RPC_INTERNAL_ERROR problems Date: Wed, 27 May 92 15:24:12 PDT From: Mike Kupfer <kupfer> There seem to be two problems here. First is that routes are getting lost and cannot be put back in. Second is that the consistency code keeps trying even in the face of lost routes. I don't understand enough about how routing is supposed to work, but the first problem seems to be mismanagement of the route table. I looked at the route entries for forgery from the allspice.forgeryRoute core dump. The first element is $2 = {links = {prevPtr = 0xf61b30c8, nextPtr = 0xf69f9e08}, routeID = 2818048, protocol = 0, netAddress = { {type = 1, address = { ether = {byte1 = 8 '\b', byte2 = 0 '\000', byte3 = 43 '+', byte4 = 25 '\031', byte5 = 153 '\231', byte6 = 72 'H'}, ultra = {data = {"\b\000+\031\231H\353P"}}, fddi = {byte1 = 8 '\b', byte2 = 0 '\000', byte3 = 43 '+', byte4 = 25 '\031', byte5 = 153 '\231', byte6 = 72 'H'}, inet = 134228761}}, {type = 0, address = { ether = {byte1 = 0 '\000', byte2 = 0 '\000', byte3 = 0 '\000', byte4 = 2 '\002', byte5 = 255 '\377', byte6 = 255 '\377'}, ultra = {data = {"\000\000\000\002\377\377\377\377"}}, fddi = {byte1 = 0 '\000', byte2 = 0 '\000', byte3 = 0 '\000', byte4 = 2 '\002', byte5 = 255 '\377', byte6 = 255 '\377'}, inet = 2}}}, spriteID = 43, flags = 0, refCount = 1, desc = {"Route to forgery - ethernet, raw\000"...}, headerPtr = {0xf69f9da4 "\b", 0x83 <Address 0x83 out of bounds>}, interPtr = 0xf6105020, minPacket = 0, maxPacket = 1500, minRpc = 0, maxRpc = 17408, userData = 0x0, buffer = {...}} Note that the flags are 0, which means the route is not valid. The next routing element for forgery is $7 = {links = {prevPtr = 0xf69f9d10, nextPtr = 0xf69fd258}, routeID = 2818049, protocol = 0, netAddress = { {type = 1, address = { ether = {byte1 = 8 '\b', byte2 = 0 '\000', byte3 = 43 '+', byte4 = 25 '\031', byte5 = 153 '\231', byte6 = 72 'H'}, ultra = {data = {"\b\000+\031\231H\000 "}}, fddi = {byte1 = 8 '\b', byte2 = 0 '\000', byte3 = 43 '+', byte4 = 25 '\031', byte5 = 153 '\231', byte6 = 72 'H'}, inet = 134228761}}, {type = 0, address = { ether = {byte1 = 255 '\377', byte2 = 255 '\377', byte3 = 255 '\377', byte4 = 255 '\377', byte5 = 0 '\000', byte6 = 0 '\000'}, ultra = {data = {"\377\377\377\377\000\000\000\000"}}, fddi = {byte1 = 255 '\377', byte2 = 255 '\377', byte3 = 255 '\377', byte4 = 255 '\377', byte5 = 0 '\000', byte6 = 0 '\000'}, inet = 4294967295}}}, spriteID = 43, flags = 1, refCount = 0, desc = {"Route to forgery - ethernet, raw\000\000"...}, headerPtr = {0xf69f9e9c "\b", 0x1 <Address 0x1 out of bounds>}, interPtr = 0xf6105020, minPacket = 0, maxPacket = 1500, minRpc = 0, maxRpc = 17408, userData = 0x0, buffer = {...}} Note that the valid bit is turned on in the flags. I can't tell if this route really is okay or not. Nonetheless, it won't get used, because RpcOutput's call to Net_IDToRoute specifies that only the first (index 0) route should be used. I'm not so sure that the second problem is really a bug, since theoretically you could use netroute to reload the routing table. If we do declare it to be a bug, then the guilty code is in ClientCommand: the "if" around Fsconsist_Kill should be changed to include RPC_INTERNAL_ERROR. mike Log-Number: 32445 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Wed, 27 May 1992 15:47:10 PDT Subject: Re: more on RPC_INTERNAL_ERROR problems Thanks for tracking this down. It looks to me like there are at least 4 separate bugs here. First, RpcOutput should call Net_IDToRoute with an index of -1, so that the RPC size is used to determine the route to use. Second, Net_IDToRoute does the wrong thing with the index. The index should refer to valid routes but currently it refers to all routes. Net_IDToRoute only returns a pointer to a valid route, and route 0 isn't valid so it doesn't return anything. If you are wondering why we need invalid routes at all it is because a route could be deleted while it is inuse. In that case the route is marked as invalid so that it will be ignored by subsequent calls to Net_IDToRoute. The invalid route will be deleted once its reference count drops to 0. This leads us to the third bug. For some reason the invalid route to forgery is not being cleaned up. Perhaps an RPC to forgery is hung? Finally, I think the upper-level code should declare the client dead if it can't find a route to it. Normally the Net_IDToRoute will do an ARP if it can't find a route to a host so this situation shouldn't happen very often. I'll fix the first two bugs if someone else (Mike?) will fix the last one. John Log-Number: 32448 Date: Fri, 29 May 92 07:32:05 PDT From: bmiller (Bob Miller) Subject: lust hung Lust was hung this morning... MachKernelException Handler: Address error on load: addr: 3 PC: 8004ed94 Entering debugger with a TLB load address exception at PC 0x8004ed94 I rebooted lust. Bob Log-Number: 32449 Date: Fri, 29 May 92 09:37:48 PDT From: voelker (Geoffrey M. Voelker) Subject: Re: lust hung Shoot, I thought for sure that this was in the net module, but 0x8004ed94 is in devSCSI90.c, line 441. I don't know how reliable that address is, but it's in the interrupt handler in the section that handles IR_ILL_CMD (illegal command). -geoff Log-Number: 32451 Subject: mysterious allspice hang Date: Mon, 01 Jun 92 16:18:11 PDT From: Mike Kupfer <kupfer> Allspice started hanging RPCs. There wasn't anything suspicious on the console, and L1-p didn't show anything interesting. I typed "uptime", which seemed to hang, so I took a core dump and rebooted. The core dump is /home/ginger/cores/allspice.hung.1jun, and the kernel was 1.112. mike Log-Number: 32494 Subject: Re: mysterious allspice hang Date: Wed, 10 Jun 92 15:14:35 PDT From: Mike Kupfer <kupfer> I looked at the core dump from the June 1st allspice hang. It wasn't enlightening. There were a lot of processes stuck waiting for somebody to finish consistency on /etc/spritehosts, e.g., #0 0xf600c658 in Mach_ContextSwitch () #1 0xf60cf0bc in SyncEventWaitInt (...) (...) #2 0xf60ce2ac in Sync_SlowWait (...) (...) #3 0xf6049664 in StartConsistency ( consistPtr=(struct Fsconsist_Info *) 0xf65c8c40, clientID=14, useFlags=36865, cacheablePtr=(ClientData) 0xf76bd328) (fsconsistCache.c line 363) #4 0xf60495d4 in Fsconsist_FileConsistency ( handlePtr=(struct Fsio_FileIOHandle *) 0xf65c8c40, clientID=14, useFlags=36865, cacheablePtr=(ClientData) 0xf76bd328, openTimeStampPtr=(ClientData) 0xf76bd330) (fsconsistCache.c line 299) #5 0xf604df04 in Fsio_FileNameOpen (...) (...) #6 0xf6053150 in FslclOpen (...) (...) #7 0xf605c2a0 in Fsprefix_LookupOperation (...) (...) #8 0xf6030754 in Fs_Open (...) (...) #9 0xf603fc10 in Fs_OpenStub (...) (...) #10 0xf6011d1c in MachFetchArgsEnd () A lot of the processes were RPC servers. The consistency struct contained (kgdb) print *$5 $19 = {lock = {inUse = 0, waiting = 0, name = 0xf6049520 "Fs:consistLock", holderPC = 0xf60495b8 "@\002\021\310\220\020", holderPCBPtr = 0xf668cec0}, flags = 5, lastWriter = -1, openTimeStamp = 1819182, hdrPtr = 0xf65c8b80, clientList = {prevPtr = 0xf65c5eb0, nextPtr = 0xf6e38b90}, msgList = {prevPtr = 0xf6d76848, nextPtr = 0xf74a2508}, consistDone = {waiting = 1}, repliesIn = {waiting = 1}} (kgdb) print $5.hdrPtr.name $20 = (char *) 0xf65c5770 "spritehosts" (kgdb) print /x $5.hdrPtr.flags $21 = 0x00000001 (kgdb) print /x $5.hdrPtr.refCount $22 = 0x00000014 If I was able to find the right set of #defines for the flags, the consistency flags are FS_CONSIST_IN_PROGRESS|FS_CONSIST_TIMEOUT, and the handle header flag is FS_HANDLE_INSTALLED. mike Log-Number: 32452 Date: Mon, 1 Jun 92 18:05:10 PDT From: shirriff (Ken Shirriff) Subject: Out of control csh -i on lust An out of control csh -i on lust was using 90% of the cpu. I found another one on arson. I've tried to debug them, but without success. Ken Log-Number: 32453 Subject: recovery deadlock, writeback failure Date: Mon, 01 Jun 92 22:41:54 PDT From: Mike Kupfer <kupfer> I ran into a case where allspice started cleaning then gave up on a writeback request, while the client it gave up on (sage) was still alive. The client then got stuck waiting to do recovery with allspice. Here's more or less what sage's syslog had: RpcDoCall: <write> RPC to allspice is hung RpcDoCall: <write> RPC to allspice is hung <write> RPC ok <write> RPC ok 6/1/92 21:20:52 allspice RmtFile "sun4.md/fsAttributes.o" <...> Writeback failed: stale handle ... allspice (14) recovering handles Fsprefix_OpenCheck waiting for recovery Fsprefix_OpenCheck waiting for recovery Fsprefix_OpenCheck waiting for recovery Here's the relevant section from allspice's syslog: /sprite/src/kernel: Cleaning started - deficit 60 segs ConsistTimeout (1 minutes) client 33 write-back file <2,53970> "fsAttributes.o" Client state killed: 0 refs 0 write 0 exec /sprite/src/kernel: Cleaned 86 segments in 32 segments /sprite/src/kernel: Cleaning started - deficit 13 segs FsrmtFileVerify: "fsAttributes.o" <2,53970> client 33 not found Fsrmt_RpcWrite, stale handle <2,53970> client 33 Fscache_BlockRead: Giving zeros to "fsAttributes.o" <2,53970> block 1 amount 43657 /sprite/src/kernel: Cleaned 116 segments in 50 segments Now, the reason why sage couldn't recover with allspice seems to be that it had deadlocked. Here's the process in Fsutil_Reopen: (gdb) bt #0 0xf600c5c0 in Mach_ContextSwitch () #1 0xf60bb914 in SyncEventWaitInt (event=4133358952, wakeIfSignal=0) (syncLock.c line 634) #2 0xf60ba79c in Sync_SlowLock (lockPtr=(struct Sync_KernelLock *) 0xf65e0d68) (syncLock.c line 214) #3 0xf60ba558 in Sync_GetLock (lockPtr=(struct Sync_KernelLock *) 0xf65e0d68) (syncLock.c line 129) #4 0xf6040590 in Fscache_OkToScavenge (cacheInfoPtr= (struct Fscache_FileInfo *) 0xf65e0d48) (fscacheOps.c line 420) #5 0xf605bcac in FsrmtFileReopen (hdrPtr= (struct Fs_HandleHeader *) 0xf65e0cf0) (fsrmtFile.c line 276) #6 0xf6061780 in ReopenHandles (serverID=14) (fsutilRecovery.c line 219) #7 0xf6061540 in Fsutil_Reopen (serverID=14) (fsutilRecovery.c line 125) #8 0xf6061d04 in Fsutil_AttemptRecovery (data=(ClientData) 0xf65e0cf0, callInfoPtr=(Proc_CallInfo *) 0xf802fdd8) (fsutilRecovery.c line 522) #9 0xf60a3ec4 in Proc_ServerProc () (procServer.c line 380) #10 0xf60b7330 in Sched_StartKernProc (...) (...) It's waiting for the monitor lock on the cache info object. The process that holds the lock is doing a writeback on that cache block. #0 0xf600c5c0 in Mach_ContextSwitch () #1 0xf60bb914 in SyncEventWaitInt (event=4133358996, wakeIfSignal=0) (syncLock.c line 634) #2 0xf60bab04 in Sync_SlowWait (conditionPtr= (struct Sync_Condition *)0xf65e0d94, lockPtr=(struct Sync_KernelLock *) 0xf60e0880, wakeIfSignal=0) (syncLock.c line 279) #3 0xf603d434 in Fscache_FileWriteBack (cacheInfoPtr= (struct Fscache_FileInfo *) 0xf65e0d48, firstBlock=-161581352, lastBlock=11, flags=1, blocksSkippedPtr=(ClientData) 0xf801dc9c) (fscacheBlocks.c line 1391) #4 0xf60406c8 in Fscache_Consist (cacheInfoPtr= (struct Fscache_FileInfo *) 0xf65e0d48, flags=1, cachedAttrPtr=(struct Fscache_Attributes *) 0xf801dd20) (fscacheOps.c line 469) #5 0xf6043ea0 in ProcessConsist (data=(ClientData) 0xf65b64c8, callInfoPtr=(Proc_CallInfo *) 0xf801ddd8) (fsconsistCache.c line 1952) #6 0xf60a3ec4 in Proc_ServerProc () (procServer.c line 380) #7 0xf60b7330 in Sched_StartKernProc (...) (...) I think the reason this process is stuck has something to do with the fact that Fscache_GetDirtyBlock returns a NIL pointer if the server is down (see FsrmtCleanBlocks). (gdb) print /x cacheInfoPtr.flags $15 = 0x00004082 mike Log-Number: 32454 Subject: no feedback if rmt process killed? Date: Mon, 01 Jun 92 22:56:05 PDT From: Mike Kupfer <kupfer> I just ran into a problem where I'd try to make sun4.md/fs.o. According to pmake, everything went fine. The only problem is, sun4.md/fs.o was never created. Finally I did "pmake -X" and found that ld was going into the debugger. (I assume this is because one or more .o files got trashed because of the problems I had earlier this evening with allspice.) Sprite carefully destroys migrated processes instead of letting them go into the debugger, but I would have expected pmake to get some sort of notification. mike Log-Number: 32455 Subject: procDebug deadlock Date: Mon, 01 Jun 92 23:29:32 PDT From: Mike Kupfer <kupfer> Terrorism got stuck earlier this evening. A bunch of processes were trying to lock process 43e19. Process 43e19 was stuck in AddToDebugList waiting for the procDebug.c monitor lock (it had a SIGDEBUG). The process holding the procDebug monitor lock was gdb. It was trying to lock process 23e2e, which was itself locked and, like 43e19, had a SIGDEBUG and was trying to get the procDebug monitor lock via AddToDebugList. So the bottom line is that there is a deadlock between Proc_SuspendProcess/AddToDebugList, which locks the PCB and then gets the procDebug monitor lock, and ProcGetThisDebug, which gets the procDebug monitor lock and then locks the PCB. mike Log-Number: 32457 Subject: problems with rlogin for weekly dumps and log out? Date: Tue, 02 Jun 92 13:36:07 PDT From: Mike Kupfer <kupfer> Mary ran into problems trying to run the weekly dumps from murder. She'd rlogin to sassafras, start the dumps, then log out. The dumps would continue... until the current filesystem was dumped, then the dumps would stop. This would happen even with stdout (and stderr?) redirected. The last few messages in the redirected output would be dump exiting, there were 0 non-fatal errors, 0 hard errors csh: ioctl(fd, FIONCLEX, NULL) failed: I/O error csh: ioctl(fd, FIONCLEX, NULL) failed: I/O error csh: ioctl(fd, FIONCLEX, NULL) failed: I/O error csh: ioctl(fd, FIONCLEX, NULL) failed: I/O error csh: ioctl(fd, FIONCLEX, NULL) failed: I/O error Note that the message is coming from csh, not from tar, so I don't think this is a tape error. Also, when I ran the weekly dumps directly from sassafras's console, these messages didn't appear. Does anyone know what's going on here? Should I edit the dump how-to to say that once you start the weekly dumps, you have to leave that shell around until they complete? mike Log-Number: 32462 Date: Wed, 3 Jun 92 09:18:57 PDT From: pmchen (Peter M. Chen) Subject: Re: problems with rlogin for weekly dumps and log out? I ran into a similar problem once using "sync" while remotely logged in. As I recall, I would rlogin, source a script that generated a sync, then logout. The way I fixed it was to put the script that I sourced into a csh script. Try putting the dump script into a csh script that calls the dump script. Pete Log-Number: 32459 Subject: lust crash: packet too big Date: Tue, 02 Jun 92 18:40:08 PDT From: Mike Kupfer <kupfer> Fatal Error: OutputPacket: packet too large (4174) (entering debugger at PC) 0x800ee71c I tried debugging it from dill, but dill couldn't connect with lust. Dumping system log ... Timing out and resending to host lust Timing out and resending to host lust Timing out and resending to host lust By the way, shouldn't there be a printf in panic() that gives the version of the kernel, so that you know which symbols to use? There's no kmsg on dill, so you can't do "kmsg -v". mike Log-Number: 32460 Subject: cpp version confusion Date: Tue, 02 Jun 92 22:27:19 PDT From: Mike Kupfer <kupfer> It occurred to me that if I fix up /sprite/cmds.ds3100/cpp to be a link to the Ultrix cpp, I need to make sure it doesn't get overwritten by a "make install" in the GNU cpp source directory. So, I trundled over to /sprite/src/cmds/cpp and discovered that the version installed there (1.37.1) is not the version we're all running (1.36). Anyone know what the scoop is? Is 1.37.1 safe to use? mike Log-Number: 32463 Date: Wed, 3 Jun 92 12:46:51 PDT From: shirriff@ginger.CS.Berkeley.EDU (Ken Shirriff) Subject: Allspice: DISK FULL, crash Allspice's disk filled up and it crashed. It crashed in a new place: CreateFile: aborting create of 163303 (syms.texi) in 163279 Log-Number: 32467 Date: Wed, 3 Jun 92 23:15:36 PDT From: shirriff@ginger.CS.Berkeley.EDU (Ken Shirriff) Subject: Allspice crash: netroute Allspice ran into the same problem with: ClientCommand: returnAttrs to client 1 failed "spritehosts" 30004 I rebooted. Log-Number: 32468 Date: Thu, 4 Jun 92 07:44:31 PDT From: bmiller (Bob Miller) Subject: allspice reboot allspice wasn't responding this morning. My syslog window show allspice and sage hung. Allspice's console showed 'Spurious interrupt' and 'Reinit recv unit' messages. I rebooted with 'new'. Bob Log-Number: 32470 Subject: 608-2 printer Date: Thu, 04 Jun 92 13:53:35 PDT From: Mike Kupfer <kupfer> I have a simple test case that fails to print the 3rd page (out of 3 pages) on the 608-2 Laserwriter. I hooked up the 608-8 printer to Sage and the test case works fine. So, I think we can conclude that the fault is with the printer, rather than with the cabling or with Sage. So, didn't we decide at the Sprite meeting last week to give up on repair attempts and just get a new printer? Maybe we can bury the old one under Soda Hall in a time capsule... mike Log-Number: 32471 From: mgbaker (Mary Baker) Subject: fs switch on -1 type again Date: Thu, 04 Jun 92 15:25:55 PDT We haven't seen one of these in a long time, but my machine just went into the debugger for a switch through an fs table using a type of -1. Here's the details: (kgdb) where #0 panic (__builtin_va_alist=-166931079) (sysPrintf.c line 227) sysPrintf.c: no such file or directory. #1 0xf60cd92c in MachPageFault (busErrorReg=128, addrErrorReg=(char *) 0x8 <Address 0x8 out of bounds>, trapPsr=289407175, pcValue=(char *) 0x0) (sun4c.md/machCode.c line 1389) #2 0xf60d1d04 in MachHandlePageFault () #3 0xf60e9748 in Fs_Open (name=(char *) 0xf81bba58 "sun4c.md/fsutil.h", useFlags=36865, type=0, permissions=-159761008, streamPtrPtr=(struct Fs_Stream **) 0xf81bba4c) (fsNameOps.c line 143) #4 0xf60f8bf0 in Fs_OpenStub (...) (...) #5 0xf60d1a7c in MachFetchArgsEnd () Reading in symbols for fsNameOps.c...list done. #3 0xf60e9748 in Fs_Open (name=(char *) 0xf81bba58 "sun4c.md/fsutil.h", useFlags=36865, type=0, permissions=-159761008, streamPtrPtr=(struct Fs_Stream **) 0xf81bba4c) (fsNameOps.c line 143) 143 openResults.streamData, name, &streamPtr->ioHandlePtr); (kgdb) 138 useFlags, name, (Boolean *)NIL, (Boolean *)NIL); 139 streamPtr->nameInfoPtr = nameInfoPtr; 140 Fsutil_HandleUnlock(streamPtr); 141 status = (*fsio_StreamOpTable[openResults.ioFileID.type].ioOpen) 142 (&openResults.ioFileID, &streamPtr->flags, rpc_SpriteID, 143 openResults.streamData, name, &streamPtr->ioHandlePtr); 144 if (status == SUCCESS) { 145 if (streamPtr->flags & FS_TRUNC) { 146 (void)Fs_TruncStream(streamPtr, 0); 147 } (kgdb) p openResults $2 = { ioFileID = { type = -1, serverID = 0, major = 73, minor = 134217728 }, streamID = { type = 2, serverID = 14, major = 14, minor = 169211 }, nameID = { type = 2, serverID = 14, major = 2, minor = 87878 }, dataSize = 52, streamData = 0xf6834bd0 } Mary Log-Number: 32475 Subject: can't shut down 1.114 cleanly on ds5000 Date: Fri, 05 Jun 92 12:26:27 PDT From: Mike Kupfer <kupfer> I frequently have problems (random exceptions) shutting down kernels that contain the FDDI support. Geoff told me that this seems to be related to using dynamically allocated buffers for FDDI. If the buffers are statically allocated, the exceptions don't happen. mike Log-Number: 32479 Date: Mon, 8 Jun 92 11:22:10 -0700 From: <voelker@almaden.ibm.com> (Geoff Voelker) Subject: problems with shutdown & FDDI on dec5000s Mike is correct, there was a problem with shutdown messing up when the FDDI module dynamically allocated its receive ring buffers. If this problem persists and a quick fix is desired, the FDDI module can be switched to use statically allocated buffers (with which I've never encountered the shutdown problem). Just #define NET_DF_USE_UNCACHED_MEM in kernel/net/ds5000.md/netDFInt.h. (The FDDI driver used uncached memory with static buffers from way back when I was building the driver, and then I switched to dynamically allocated buffers + cache flushing once the driver started working. Hence the USE_UNCACHED_MEM == static buffers. It was this switch that gave the FDDI driver the performance boost.) Switching to static buffers will increase the kernel size by about 45k. -geoff p.s. Mike wisely suggested that I list the known bugs before I left, and I unwisely forgot to do so. In the driver, this was the only known bug that I had not fixed (which probably forebodes bunches and bunches that I don't know about :) Log-Number: 32480 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Mon, 8 Jun 1992 11:43:42 PDT Subject: Re: problems with shutdown & FDDI on dec5000s The ds5000s crash on shutdown in the prom somewhere. The reserved instruction exception happens after the kernel calls the prom to shutdown the machine. I'm looking into the problem, which I believe might have something to do with the stack. In the meantime if you shut down a ds5000 you'll have to push the reset switch on the back. John Log-Number: 32477 Subject: misc. man page glitches Date: Fri, 05 Jun 92 12:51:39 PDT From: Mike Kupfer <kupfer> Something for the Spring Cleaning list... I had occasion to manually reindex the man pages. There were a few man pages, listed below, that caused reindex to complain. Also, reindex bailed out when it got to the SWW man pages, complaining index: invalid argument I assume we're still able to read the indexing information on the SWW, so I guess it's not a major problem. On the other hand, any man directories that appear after the SWW ones will apparently not get indexed. mike -- /sprite/man/cmds Couldn't find "NAME" section in "cb.man". Couldn't find "NAME" section in "cc_mips.man". Couldn't find "NAME" section in "cvs.man". Unexpected end-of-file in KEYWORDS section of "lockdir.man". Couldn't find "NAME" section in "trchange.man". /local/man/cmds Couldn't find "NAME" section in "checkin.man". Couldn't find "NAME" section in "mkmodules.man". Couldn't find "NAME" section in "sup.man". Log-Number: 32488 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 9 Jun 1992 17:24:15 PDT Subject: mx/tx bug fixed (sort of) I've boosted the size of the selection that can be sent by mx and tx to 256K. This should be big enough to handle most things. If you go larger than that you'll get the "nothing selected" message. Of course this makes the processes use more memory, but my changes to mx and tx are supposed to be temporary anyway. John Log-Number: 32491 Subject: ds3100 "at" lost its setuid bit again Date: Wed, 10 Jun 92 12:02:35 PDT From: Mike Kupfer <kupfer> ... this also happened earlier this year, with no indication of what had happened. Why is that the ds3100 "at" binaries (at, atq, atrm) lose their setuid bit, but none of the binaries for the other architectures do? mike Log-Number: 32495 Date: Wed, 10 Jun 92 18:20:16 PDT From: mottsmth (Jim Mott-Smith) Subject: Allspice died with lfsSetSegUsage bad segment number Allspice died with lfsSetSegUsage bad segment number 937354 Core is in /home/ginger/cores/allspice.lfsSetSegUsage -- Jim M-S Log-Number: 32497 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 12 Jun 1992 11:59:32 PDT Subject: sendmail is acting up again Sendmail has been misbehaving again this morning. It keeps getting "NOQUEUE: SYSERR: getrequests: accept: invalid argument" messages. Someone has put in code to reopen the socket if the accept fails, but that doesn't seem to fix the problem. Restarting sendmail seems to do the trick (at least for a short time). We should get serious about fixing this bug. John Log-Number: 32498 Date: Fri, 12 Jun 92 14:38:45 PDT From: shirriff@ginger.CS.Berkeley.EDU (Ken Shirriff) Allspice crashed with LfsSetSegUsage bad segment number 1776985. I think this is an old problem, but I took a core dump anyways in case it is not. Log-Number: 32501 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 12 Jun 1992 16:18:39 PDT Subject: Proc_StringNCopy The routine Proc_StringNCopy() has an argument "numBytes" whose comment says /* Maximum number of bytes to copy. */. This isn't strictly true, however, as a null character is always appended to the output string even if you stopped copying characters because numBytes was exceeded. This means you may get the null written beyond the end of the buffer. Perhaps numBytes should be the maximum number of non-null bytes copied, but even so some of the routines in the fs module are using it incorrectly. John Log-Number: 32502 Subject: lust crash: /user5 filled up Date: Fri, 12 Jun 92 17:48:07 PDT From: Mike Kupfer <kupfer> I rebooted with the 1.114 kernel. mike Log-Number: 32503 Date: Sat, 13 Jun 92 10:46:34 PDT From: mottsmth (Jim Mott-Smith) Subject: Lust died with TLB load address error Lust was dead when I came in. The console said: MachKernelExceptionHandler: Address error on load: addr: 3 PC 8004ee44 TLB load address error exception at 8004ee44 I rebooted with 'new'. -- Jim M-S Log-Number: 32504 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Sat, 13 Jun 1992 21:09:44 PDT Subject: lust crash (followup) PC 0x8004ee44 is in DevSCSIC90Intr: (kgdb) l *0x8004ee44 0x8004ee44 is in DevSCSIC90Intr (ds5000.md/devSCSIC90.c, line 441). 436 if (ctrlPtr->interruptDevPtr != (Device *)NIL) { 437 MASTER_UNLOCK(&(ctrlPtr->mutex)); 438 return TRUE; 439 } else { 440 printf("%s: illegal command.\n", 441 devPtr->handle.locationName); 442 status = FAILURE; 443 } 444 } 445 if (interruptReg & IR_SLCT_ATN) { If this happens again please debug it so we can figure out what happened. John Log-Number: 32505 Date: Mon, 15 Jun 92 07:52:43 PDT From: bmiller (Bob Miller) Subject: reboot Allspice was hung this morning. I rebooted. The console was showing: Proc_Exec: Can't run sun3 ZMAGIC executable file on sun4. No stream <363903> for client 1 Fsrmt_RpcRead no stream <363903> to handle <0,47084> client 1 Log-Number: 32514 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 16 Jun 1992 21:52:35 PDT Subject: /usr/sww/bin/xwaisq crashes ds5000 The stack appears to be messed up: (kgdb) where #0 UNIXSyscall () (ds5000.md/machAsm.s line 2285) #1 0x80033ad8 in UNIXSyscall () (ds5000.md/machAsm.s line 2123) (kgdb) l * $pc 0x80033ae0 is in UNIXSyscall (ds5000.md/machAsm.s, line 2285). 2280 /* 2281 * If memory interrupts aren't turned on then we can't do a 2282 * probe. 2283 */ 2284 mfc0 t0, MACH_COP_0_STATUS_REG 2285 nop 2286 and t0, t0, MACH_INT_MASK_3 | MACH_SR_INT_ENA_CUR 2287 beq t0, MACH_INT_MASK_3 | MACH_SR_INT_ENA_CUR, 1f 2288 nop 2289 j ra The pc is actually in Mach_Probe. Something is really going off the deep end. This happened running ds5000.1.114. John Log-Number: 32515 Date: Wed, 17 Jun 92 12:30:35 PDT From: mottsmth (Jim Mott-Smith) Subject: Can't run xv under compatibility Trying to run /usr/sww/X11/bin/xv on Sabotage says: ld.so: text write-enable error (22) for main_$main_ Running it on Covet seems to hang the process. -- Jim M-S Log-Number: 32518 Date: Thu, 18 Jun 92 15:32:47 PDT From: sullivan (Mark Sullivan) Subject: make path problems My makefile contains the following target: ------------------------------ clean: /bin/rm *.o ln -s vmStubs.back vmStubs.o /bin/rm testgram.c ------------------------------ If I execute "make" on that makefile, I get: babylon<1> make clean /bin/rm *.o ln -s vmStubs.back vmStubs.o ln: not found *** Error code 1 "pmake" doesn't seem to have any problems finding ln and make works fine if I replace ln with /bin/ln. Sounds like make has a path problem. Mark Log-Number: 32519 Date: Thu, 18 Jun 92 15:35:41 PDT From: sullivan (Mark Sullivan) Subject: amended bug report Problem only occurs on ds3100. make works fine on the ds5000. Mark Log-Number: 32522 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 19 Jun 1992 17:57:55 PDT Subject: lfs bug? Pete Chen has been having some problems with LFS deadlocking, and I think I've tracked down the problem but I want to double-check with Mendel to be sure. There is a loop at the top of PlaceFileInSegment() in the LFS module in which all the dirty blocks for a file are obtained from Fscache_GetDirtyBlock(). One of the things that Fscache_GetDirtyBlock() does is set the FSCACHE_BLOCK_BEING_WRITTEN flag for the block. Later in PlaceFileInSegment() the dirty blocks for the file are processed in the order: doubly-indirect, indirect, direct. This loop places the blocks into the segment, and while doing so has to update the index for the block. It does this by calling LfsFile_SetIndex(), which updates the index for the block, whether it be in a descriptor or an indirect block. The cacheFlags parameter passed to LfsFile_SetIndex() is FSCACHE_CANT_BLOCK, whose meaning is pretty much undocumented, but which shouldn't be confused with FSCACHE_DONT_BLOCK. Here is how the deadlock happens. One of the blocks to be written is pointed to by an indirect block that is also dirty. When LfsFile_SetIndex() is called for this block it eventually calls Fscache_FetchBlock() for the parent block, but Fscache_FetchBlock() blocks because the parent block is marked as FSCACHE_BLOCK_BEING_WRITTEN, and the FSCACHE_DONT_BLOCK flag is not set. Then lfs comes to a grinding halt. So, I think the correct solution is to pass FSCACHE_DONT_BLOCK as (one of) the flags to LfsFile_SetIndex(), but to be honest I don't fully understand the difference between these two flags and I don't really want to mix-n-match. Actually, it appears that if the FSCACHE_DONT_BLOCK were set then Fscache_FetchBlock() would return a NIL block pointer, causing LfsFile_SetIndex() to return FS_WOULD_BLOCK, causing PlaceFileInSegment() to panic. So, there doesn't appear to be an easy way out of this. I'm hoping Mendel will understand all of this and will suggest a possible solution (or point out my mistake in understanding the code). John Log-Number: 32523 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Fri, 19 Jun 1992 18:21:28 PDT Subject: lfs bug followup I talked to Peter Chen and his application does random overwrite and read of an exising file (no appends). I was curious as to why the indirect block was marked as dirty anyway. It seems to me that the only way an indirect block could be dirty is if you added a new block to a file, which Pete isn't doing. My guess now is that the indirect block was dirtied during a previous segment write when one of its children was written. AppendBlock() just updates the pointer in the indirect block, then returns the block to the cache via Fscache_UnlockBlock(). It doesn't appear that it will be written out in the same segment. This is kind of interesting because it means that a randomly written block will not be in the same segment as its indirect block (I don't know if the same is true of the descriptor). Anyway, if another child block gets written before the indirect block is written then the deadlock will occur. At least that's what I think anyway. John Log-Number: 32528 Subject: Re: lfs bug? ( Date: Mon, 22 Jun 92 13:31:55 -0700 From: mendel@lagunita.stanford.edu > > Pete Chen has been having some problems with LFS deadlocking, and > I think I've tracked down the problem but I want to double-check > with Mendel to be sure. There is a loop at the top of PlaceFileInSegment() > in the LFS module in which all the dirty blocks for a file are > obtained from Fscache_GetDirtyBlock(). One of the things that > Fscache_GetDirtyBlock() does is set the FSCACHE_BLOCK_BEING_WRITTEN > flag for the block. Later in PlaceFileInSegment() the dirty blocks > for the file are processed in the order: doubly-indirect, indirect, > direct. This loop places the blocks into the segment, and while > doing so has to update the index for the block. It does this by > calling LfsFile_SetIndex(), which updates the index for the block, > whether it be in a descriptor or an indirect block. The cacheFlags > parameter passed to LfsFile_SetIndex() is FSCACHE_CANT_BLOCK, whose > meaning is pretty much undocumented, but which shouldn't be confused > with FSCACHE_DONT_BLOCK. FSCACHE_DONT_BLOCK means don't block if the requested block is busy. FSCACHE_CANT_BLOCK means the requesting process can not block for any reason. The difference between the flags is the CANT_BLOCK flag will work even if the cache is "full" of dirty blocks. It will never block because the cache contains no clean blocks. Basically the flag informs the cache code that it should dip into its list of reserved blocks if necessary. > Here is how the deadlock happens. One of > the blocks to be written is pointed to by an indirect block that > is also dirty. When LfsFile_SetIndex() is called for this block it > eventually calls Fscache_FetchBlock() for the parent block, but > Fscache_FetchBlock() blocks because the parent block is marked as > FSCACHE_BLOCK_BEING_WRITTEN, and the FSCACHE_DONT_BLOCK flag is > not set. Then lfs comes to a grinding halt. So, I think the correct > solution is to pass FSCACHE_DONT_BLOCK as (one of) the flags to > LfsFile_SetIndex(), but to be honest I don't fully understand the > difference between these two flags and I don't really want to > mix-n-match. Yuck. > > Actually, it appears that if the FSCACHE_DONT_BLOCK were set then > Fscache_FetchBlock() would return a NIL block pointer, causing > LfsFile_SetIndex() to return FS_WOULD_BLOCK, causing PlaceFileInSegment() > to panic. So, there doesn't appear to be an easy way out of this. > I'm hoping Mendel will understand all of this and will suggest a > possible solution (or point out my mistake in understanding the > code). You are right. > I talked to Peter Chen and his application does random overwrite > and read of an exising file (no appends). I was curious as to why > the indirect block was marked as dirty anyway. It seems to me that > the only way an indirect block could be dirty is if you added a > new block to a file, which Pete isn't doing. The file index (i.e. inode and indirect blocks) are modified everytime a block is written is a LFS. See below. > My guess now is that > the indirect block was dirtied during a previous segment write when > one of its children was written. AppendBlock() just updates the > pointer in the indirect block, then returns the block to the cache > via Fscache_UnlockBlock(). It doesn't appear that it will be written > out in the same segment. This is kind of interesting because it > means that a randomly written block will not be in the same segment > as its indirect block (I don't know if the same is true of the > descriptor). Anyway, if another child block gets written before > the indirect block is written then the deadlock will occur. At > least that's what I think anyway. Here is what I think is happening: The code in PlaceFileInSegment() treats a file as a tree with the inode being the root and the data blocks being the leaves. It looks something like: Inode ______________________|____________________ / | \ | | | | I-1 I-2 | | ______|_____ /| | / \ / |\ | / \ / | \ ____|____ I-3 I-4 .... / / \ / / \ ___|___ ____|___ / | \ / / \ / / \ / / \ D0 D1 ... D9 D10 D11 ... D1033 D1034 D1035 ... D2058 D2059 ... Where Dn are data blocks and I-n are indirect blocks. The code works by placing the tree in the segment one level at a time starting with the leaves (Dn). Placing the data blocks causes the first level indirect blocks (I-3, I-4, etc) to be modified. These blocks are placed next. Next the I-1 and I-2 blocks are placed in the segment. Finally, once all the indirect and data blocks are placed in the segment the inode is placed in the segment. The principle of placing the tree bottom up means that all the modifications to an indirect block should be made before the indirect block is placed in the segment. Therefore the deadlock you found should not occur. The deadlock occurs because PlaceFileInSegment() may be called multiple times for the same file. This happens when the current segment summary block fills and a new one needs to be allocated. What happens is that PlaceFileInSegment() gets a file and starts placing it in a segment. It places all the dirty data blocks and some of the indirect blocks. At this point it detects that the segment summary block is filled so it can't add the rest of the indirect blocks. PlaceFileInSegment() then returns TRUE saying it has more data to place in a segment. The segment layout code can then add a new segment summary block and call PlaceFileInSegment() again. PlaceFileInSegment() starts with the data blocks (there should be none dirty) and then places the rest of the indirect blocks and the inode. The deadlock occurs on Peter's test case because during the time the indirect blocks and new segment summary block are being added to the segment, the program modifies some data blocks in the file that happen to be mapped by indirect blocks that have already been placed in the segment. When PlaceFileInSegment() is called for the second time on the file it finds these data blocks and tries to place them in the segment. The deadlock occurs during the update to the indirect block which is already placed in the segment. For example, assume the program modified D1034. This would cause D1034, I-3, I-2, and the inode to be placed in the segment. Assume that there is no room for I-2 so a new segment summary block is allocated. At the same time the program modifies D1035. Now when PlaceFileInSegment() is called it will place D1035 and deadlock updating I-3. I'll see if I can come up with a fix for this problem. Mendel Log-Number: 32524 Subject: random migd problems Date: Sat, 20 Jun 92 14:06:15 PDT From: Mike Kupfer <kupfer> When I logged onto sage this afternoon, its local migd was hung. According to the global log, sage had tried a couple times this morning to become the global master and had failed with MigPdev_OpenMaster: couldn't open "/sprite/admin/migd/pdev" (text file or pseudo-device busy) Eventually lust (!) became the global master, but sage was apparently unable to talk to it. There were a bunch of messages in sage's migd log that said ContactGlobal: couldn't open /sprite/admin/migd/pdev: invalid argument I had to kill off and restart the migd on sage. By the way, do we really want file servers acting as the migd global master? mike Log-Number: 32527 Date: Mon, 22 Jun 92 11:01:36 -0700 From: sullivan@postgres.Berkeley.EDU (Mark Sullivan) Subject: recovery problem My make file contains the following: clrdb: /bin/rm -r /postdev/sullivan/data/base I run "pmake clrdb" and pmake looks like it is executing, but the files don't go away. On the console (of arson), the following messages appear: 6/22/92 6:33:51 babylon (94) RmtPdev "/sprite/admin/migd/pdev" <917514,-917192620> : stale handle 6/22/92 6:33:51 babylon (94) - recovering handles 6/22/92 6:33:51 babylon (94) RmtPdev "/sprite/admin/migd/pdev" <917514,-917192620> Reopen failed : cacheable/busy conflict 6/22/92 6:33:51 babylon (94) Recovery failed: cacheable/busy conflict Note that it is not 6:30am now, so the recovery problem was hours ago. If I run the rm locally, there is no problem. If I run pmake -X, there is no problem. Mark ps. The file system in which these files are stored was corrupted and regenerated from a backup yesterday. This could be part o the problem. Log-Number: 32530 Subject: more on writeback problem during cleaning Date: Mon, 22 Jun 92 15:45:06 PDT From: Mike Kupfer <kupfer> Background: I ran into a problem some weeks ago where a compilation got migrated to covet and then the .o file couldn't get written back to the file server. Eventually the server timed out the writeback request, filling the file with zeroes. This eventually caused ld to choke. At the time covet couldn't do the writeback, the server was cleaning the filesystem that the .o file lived on. We thought that maybe the problem was that all the RPC channels on covet were busy, so I added some printf's to complain whenever a machine runs out of channels. Well, I just ran into the same writeback problem, this time between clove and lust. Lust was cleaning /user5. Clove did not report running out of RPC channels, so I think the problem is elsewhere. I'm not surprised by this, given how often clients get stuck because of cleaning (e.g., on /swap1). Here's an excerpt from lust's syslog: /user5: Cleaning started - deficit 44 segs /user5: Cleaned 44 segments in 18 segments /user5: Cleaning started - deficit 28 segs ConsistTimeout (1 minutes) client 57 write-back file <3,108638> "fsrmtDomain.o" Client state killed: 0 refs 0 write 0 exec FsrmtFileVerify: "fsrmtDomain.o" <3,108638> client 57 not found Fsrmt_RpcWrite, stale handle <3,108638> client 57 LE ethernet: Missed a packet. Fscache_BlockRead: Giving zeros to "fsrmtDomain.o" <3,108638> block 2 amount 174743 LE ethernet: Missed a packet. /user5: Cleaned 69 segments in 35 segments /user5: Cleaning started - deficit 21 segs /user5: Cleaned 95 segments in 52 segments /user5: Cleaning started - deficit 21 segs /user5: Cleaned 118 segments in 67 segments /user5: Cleaning started - deficit 15 segs Here's an excerpt from clove's syslog: 6/22/92 15:17:00 allspice (14) Client backing off again from negative ack. RpcDoCall: <write> RPC to lust is hung <write> RPC ok 6/22/92 15:24:17 lust (1) RmtFile "ds3100.md/fsrmtDomain.o" <3,108638> Write-back failed: stale handle <prefix> 6/22/92 15:24:22 broadcast (0) RPC timed-out 6/22/92 15:24:22 lust (1) - recovering handles mike [25-Jun-92: the next time this happens, the user should get the recovery debug log, e.g., using L1-y. -mdk] Log-Number: 32531 Subject: can't boot 1.114 on clove Date: Mon, 22 Jun 92 15:58:57 PDT From: Mike Kupfer <kupfer> It prints a half-dozen lines of what looks to be FDDI debugging information, followed by a complaint that the PARAM command didn't work, then it goes into the debugger. I left it in the debugger in case anyone wants to look at it. mike Log-Number: 32532 Date: Tue, 23 Jun 92 08:46:25 PDT From: bmiller (Bob Miller) Subject: check lw533 I'm not sure whether it's Sprite's or SHALLOT's problem...our printer, lw533, is hung. I've tried 'restart' through lpc with no luck. lpq shows: subversion.Berkeley.EDU: waiting for shallot to come up Rank Owner Job Files Total Size 1st bmiller 717 /tmp/ES154161 6446 bytes connection to shallot is down: connection timed out Can someone look into this as soon as possible. Thanks. Bob Log-Number: 32534 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 23 Jun 1992 10:54:24 PDT Subject: Rpc module breaks Net locking The RPC module sends an explicit acknowledgment down an idle channel when it is closed. This is done at interrupt level, and is the source of our "wrong server ID" messages. It turns out that it also breaks the net driver if the driver contains proper locking. Our current Ethernet drivers just turn off interrupts rather than use MASTER_LOCK and MASTER_UNLOCK. Geoff added locking to his FDDI driver, which causes a deadlock when a channel is closed because the lock is grabbed in the interrupt routine so that the subsequent call to Net_Output can't get it. I plan on fixing the RPC module but in the meantime I'll push out an FDDI driver that doesn't use locks. John Log-Number: 32535 From: jhh@sprite.Berkeley.EDU (John H. Hartman) Date: Tue, 23 Jun 1992 10:58:26 PDT Subject: clarification on Net/RPC deadlock My previous message is wrong concerning our current ethernet drivers. The Lance driver does use a master lock, only the interrupt handler does not grab it so the deadlock does not occur. John Log-Number: 32536 Date: Tue, 23 Jun 92 13:04:28 -0700 From: kupfer@dill (Mike Kupfer) Subject: lust crash: address fault in Fsrmt_RpcRead Right after the "packet too big" crash, lust died again. The console said Fsrmt_RpcRead, no handle <0, 116823> client 73 Fsrmt_RpcRead, no handle <0, 2> client 73 bad Vaddr = 0xce66d150 The PC was 0x800905b0 (running the 1.114 kernel). John and I poked around a bit. The crash was in Fsrmt_RpcRead. It looked like larceny (client 73) had sent an RPC with a garbage parameter block. (kgdb) bt #0 0x800905b0 in Fsrmt_RpcRead (srvToken=(int *) 0xc04d03ac, clientID=73, command=8, storagePtr=(struct Rpc_Storage *)0xc80a3fa8) (fsrmtIO.c line 229) #1 0x800e71d4 in Rpc_Server () (rpcServer.c line 258) #2 0x800ec0c4 in Sched_StartKernProc (func=(void (*)()) 0x800e6e10 <Rpc_Server>) (schedule.c line 1014) #3 0x800ec03c in Sched_StartKernProc (func=(void (*)()) 0x1) (schedule.c line 984) (kgdb) print *paramsPtr $4 = {fileID = {type = -1068538984, serverID = -1068538984, major = -1070625732, minor = -1068538988}, streamID = {type = 73, serverID = -1, major = 41, minor = 0}, waiter = {links = {prevPtr = 0x18, nextPtr = 0xffffffff}, hostID = 1325407, pid = 1325423, waitToken = 1325399}, io = {buffer = 0x10b9b5 <Address 0x10b9b5 out of bounds>, length = 1202153, offset = 1325383, flags = 41, procID = 0, familyID = 20, uid = -1096116897, reserved = -1068538684}} We put larceny into the debugger and rebooted lust, which promptly crashed again with a similar set of error messages, only with paprika as the guilty client. This time lust was not debuggable. (kgdb) attach lust Attaching remote machine lust Remote debugging using lust Dumping system log ... Error reading memory address 0x3334332e: I/O error (5). This was taken as an indication that maybe lust was having some network problems. John and Mary power cycled it and rebooted, and that seems to have fixed things. mike Log-Number: 32537 Date: Tue, 23 Jun 92 12:47:59 -0700 From: kupfer@dill (Mike Kupfer) Subject: lust crash: output packet too big Lust died with Fatal Error: OutputPacket: packet too large (4066) It was not debuggable, so I rebooted. mike Log-Number: 32541 Date: Wed, 24 Jun 92 17:09:27 -0700 From: kupfer@dill (Mike Kupfer) Subject: lust crash: address error Lust died again with another addressing problem. There weren't any interesting looking error messages on the console. The stack backtrace was #0 0xc819bfb8 in ?? () #1 0x8008d82c in Fsrmt_RpcClose (srvToken=(int *) 0xc053b7cc, clientID=11, command=10, storagePtr=(struct Rpc_Storage *) 0xc819bfa8) (fsrmtDomain.c line 719) #2 0x800e71d4 in Rpc_Server () (rpcServer.c line 258) #3 0x800ec0c4 in Sched_StartKernProc (func=(void (*)()) 0x800e6e10 <Rpc_Server>) (schedule.c line 1014) #4 0x800ec03c in Sched_StartKernProc (func=(void (*)()) 0xc0533980) (schedule.c line 984) The storage passed to Fsrmt_RpcClose was (kgdb) print storage $1 = {requestParamPtr = 0xc053ca4c "\374\347j\366\374\347j\366\001", requestParamSize = 0, requestDataPtr = 0xc053ce4c "kupfer/Mail/context", requestDataSize = 0, replyParamPtr = 0xffffffff <Address 0xffffffff out of bounds>, replyParamSize = 0, replyDataPtr = 0xffffffff <Address 0xffffffff out of bounds>, replyDataSize = 0} Note that the request parameter and data sizes are both 0. The parameter block looked like $4 = {fileID = {type = -160765956, serverID = -160765956, major = 1, minor = 357}, streamID = {type = 1849, serverID = 1, major = 1, minor = 239752}, procID = 663886, flags = 33558533, closeData = {attrs = {firstByte = -1, lastByte = 49151, accessTime = 709239284, modifyTime = 0, createTime = 700009914, userType = 0, permissions = 493, uid = 891, gid = 155}}, closeDataSize = 36} Note the bogus file type, which caused lust to jump off into hyperspace. Either paprika sent a bogus packet, or lust is having network (possibly hardware?) problems. We rebooted lust with 1.114. mike Log-Number: 32542 Subject: potential race during shutdown Date: Wed, 24 Jun 92 17:12:39 PDT From: Mike Kupfer <kupfer> The LOCK_HANDLE macro that locks FS handles will bail out if the system is shutting down. This means that the caller could "lock" an already locked handle, and the original holder of the lock could unlock the handle, leaving the caller to operate on an unlocked handle. When the caller releases its reference to the handle, Fsutil_HandleReleaseHdr panics because it was told that the handle is already locked, but by inspection it knows that the handle isn't locked. I've seen this happen once with the Sprite server, when an RPC server tried to close its current working directory while exiting. Do RPC servers in native Sprite have a non-nil current working directory? mike Log-Number: 32543 From: mgbaker (Mary Gray Baker) Subject: Debug info in netroute Date: Thu, 25 Jun 92 10:23:52 PDT There's is so much debug info being printed from netroute, that it takes a year to boot a sparcstation. Is this debug info really necessary? Mary